initial commit

2024-11-23 06:38:45 +00:00 · 2023-08-04 15:28:41 +02:00 · 2023-08-04 15:28:41 +02:00 · 48f674c433
commit 48f674c433
109 changed files with 12003 additions and 0 deletions
--- a/.github/workflows/ci.yml
+++ b/.github/workflows/ci.yml
@ -0,0 +1,31 @@
+name: CI
+on: push
+jobs:
+
+  lint_and_typecheck:
+    runs-on: ubuntu-latest
+    steps:
+      - name: checkout
+        uses: actions/checkout@v3
+
+      - name: Set up python
+        id: setup-python
+        uses: actions/setup-python@v4
+        with:
+          python-version: "3.10"
+
+      - name: Install Poetry
+        uses: snok/install-poetry@v1
+        with:
+          virtualenvs-create: true
+          virtualenvs-in-project: true
+          installer-parallel: true
+
+      - name: poetry install
+        run: poetry install --no-interaction --extras=training
+
+      - name: lint
+        run: poetry run ruff check .
+
+      - name: typecheck
+        run: poetry run pyright
--- a/.gitignore
+++ b/.gitignore
@ -0,0 +1,28 @@
+# compilation and distribution
+__pycache__/
+*.py[cod]
+dist/
+
+# virtual environments
+venv/
+
+# unit tests
+.pytest_cache/
+
+# tests' model weights
+tests/weights/
+
+# ruff
+.ruff_cache
+
+# vscode
+.vscode
+
+# Weights & Biases (offline trainings)
+wandb/
+
+# macos
+.DS_Store
+
+# model weights
+*.safetensors
--- a/21
+++ b/21
@ -0,0 +1,21 @@
+MIT License
+
+Copyright (c) 2023 Lagon Technologies
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
--- a/README.md
+++ b/README.md
@ -0,0 +1,304 @@
+<div align="center">
+
+<picture>
+  <source media="(prefers-color-scheme: dark)" srcset="assets/logo_dark.png">
+  <source media="(prefers-color-scheme: light)" srcset="assets/logo_light.png">
+  <img alt="Finegrain Refiners Library" width="352" height="128" style="max-width: 100%;">
+</picture>
+
+**The simplest way to train and run adapters on top of foundational models**
+
+______________________________________________________________________
+
+[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/refiners)](https://pypi.org/project/refiners/)
+[![PyPI Status](https://badge.fury.io/py/refiners.svg)](https://badge.fury.io/py/refiners)
+[![license](https://img.shields.io/badge/license-MIT-blue)](/LICENSE)
+</div>
+
+
+- [Motivation](#motivation)
+- [Design](#design)
+- [Downsides](#downsides)
+- [Overview](#overview)
+- [Key Concepts](#key-concepts)
+    - [The Chain class](#the-chain-class)
+    - [The Context API](#the-context-api)
+    - [The Adapter API](#the-adapter-api)
+- [Getting Started](#getting-started)
+  - [Install](#install)
+  - [Hello World](#hello-world)
+- [Training](#training)
+- [Credits](#credits)
+- [Citation](#citation)
+
+
+## Motivation
+
+At [Finegrain](https://finegrain.ai), we're on a mission to automate product photography. Given our "no human in the loop approach", nailing the quality of the outputs we generate is paramount to our success. 
+
+That's why we're building Refiners.
+
+It's a framework to easily bridge the last mile quality gap of foundational models like Stable Diffusion or Segment Anything Model (SAM), by adapting them to specific tasks with lightweight trainable and composable patches.
+
+We decided to build Refiners in the open. 
+
+It's because model adaptation is a new paradigm that goes beyond our specific use cases. Our hope is to help people looking at creating their own adapters save time, whatever the foundation model they're using.
+
+## Design
+
+We are huge fans of PyTorch (we actually were core committers to [Torch](http://torch.ch/) in another life), but we felt it's too low level for the specific model adaptation task: PyTorch models are generally hard to understand, and their adaptation requires intricate ad hoc code.
+
+Instead, we needed:
+
+- A model structure that's human readable so that you know what models do and how they work right here, right now
+- A mechanism to easily inject parameters in some target layers, or between them
+- A way to easily pass data (like a conditioning input) between layers even when deeply nested
+- Native support for iconic adapter types like LoRAs and their community trained incarnations (hosted on [Civitai](http://civitai.com/) and the likes)
+
+Refiners is designed to tackle all these challenges while remaining just one abstraction away from our beloved PyTorch.
+
+## Downsides
+
+As they say, there is no free lunch. Given Refiners comes with a new model structure, it can only work with models implemented that way. For now, we support Stable Diffusion 1.5, but more is in the making (SDXL, SAM, ...) - stay tuned.
+
+## Overview
+
+The Refiners library is made of:
+
+1. An abstraction layer (called Fluxion) on top of [PyTorch](https://pytorch.org/) to easily build models
+2. A zoo of compatible foundational models
+3. Adapter APIs to easily patch supported foundational models
+4. Training utils to train concrete adapters
+5. Conversion scripts to easily use existing community adapters
+
+## Key Concepts
+
+### The Chain class
+
+The `Chain` class is at the core of Refiners. It basically lets you express models as a composition of basic layers (linear, convolution, attention, etc) in a **declarative way**.
+
+E.g.: this is how a Vision Transformer (ViT) looks like with Refiners:
+
+```python
+import torch
+import refiners.fluxion.layers as fl
+
+class ViT(fl.Chain):
+    # The Vision Transformer model structure is entirely defined in the constructor. It is
+    # ready-to-use right after i.e. no need to implement any forward function or add extra logic
+    def __init__(
+        self,
+        embedding_dim: int = 512,
+        patch_size: int = 16,
+        image_size: int = 384,
+        num_layers: int = 12,
+        num_heads: int = 8,
+    ):
+        num_patches = (image_size // patch_size)
+        super().__init__(
+            fl.Conv2d(in_channels=3, out_channels=embedding_dim, kernel_size=patch_size, stride=patch_size),
+            fl.Reshape(num_patches**2, embedding_dim),
+            # The Residual layer implements the so-called skip-connection, i.e. x + F(x).
+            # Here the patch embeddings (x) are summed with the position embeddings (F(x)) whose
+            # weights are stored in the Parameter layer (note: there is no extra classification
+            # token in this toy example)
+            fl.Residual(fl.Parameter(num_patches**2, embedding_dim)),
+            # These are the transformer encoders:
+            *(
+                fl.Chain(
+                    fl.LayerNorm(embedding_dim),
+                    fl.Residual(
+                        # The Parallel layer is used to pass multiple inputs to a downstream
+                        # layer, here multiheaded self-attention
+                        fl.Parallel(
+                            fl.Identity(),
+                            fl.Identity(),
+                            fl.Identity()
+                        ),
+                        fl.Attention(
+                            embedding_dim=embedding_dim,
+                            num_heads=num_heads,
+                            key_embedding_dim=embedding_dim,
+                            value_embedding_dim=embedding_dim,
+                        ),
+                    ),
+                    fl.LayerNorm(embedding_dim),
+                    fl.Residual(
+                        fl.Linear(embedding_dim, embedding_dim * 4),
+                        fl.GeLU(),
+                        fl.Linear(embedding_dim * 4, embedding_dim),
+                    ),
+                    fl.Chain(
+                        fl.Linear(embedding_dim, embedding_dim * 4),
+                        fl.GeLU(),
+                        fl.Linear(embedding_dim * 4, embedding_dim),
+                    ),
+                )
+                for _ in range(num_layers)
+            ),
+            fl.Reshape(embedding_dim, num_patches, num_patches),
+        )
+
+vit = ViT(embedding_dim=768, image_size=224, num_heads=12)  # ~ViT-B/16 like
+x = torch.randn(2, 3, 224, 224)
+y = vit(x)
+```
+
+### The Context API
+
+The `Chain` class has a context provider that allows you to **pass data to layers even when deeply nested**.
+
+E.g. to implement cross-attention you would just need to modify the ViT model like in the toy example below:
+
+
+```diff
+@@ -21,8 +21,8 @@
+                     fl.Residual(
+                         fl.Parallel(
+                             fl.Identity(),
+-                            fl.Identity(),
+-                            fl.Identity()
+                            fl.UseContext(context="cross_attention", key="my_embed"),
+                            fl.UseContext(context="cross_attention", key="my_embed"),
+                         ),  # used to pass multiple inputs to a layer
+                         fl.Attention(
+                             embedding_dim=embedding_dim,
+@@ -49,5 +49,6 @@
+         )
+
+ vit = ViT(embedding_dim=768, image_size=224, num_heads=12)  # ~ViT-B/16 like
+vit.set_context("cross_attention", {"my_embed": torch.randn(2, 196, 768)})
+ x = torch.randn(2, 3, 224, 224)
+ y = vit(x)
+```
+
+### The Adapter API
+
+The `Adapter` API lets you **easily patch models** by injecting parameters in targeted layers. It comes with built-in support for canonical adapter types like LoRA, but you can also implement your custom adapters with it.
+
+E.g. to inject LoRA layers in all attention's linear layers:
+
+```python
+from refiners.adapters.lora import LoraAdapter
+
+for layer in vit.layers(fl.Attention):
+    for linear, parent in layer.walk(fl.Linear):
+        adapter = LoraAdapter(target=linear, rank=64, device=vit.device, dtype=vit.dtype)
+        adapter.inject(parent)
+
+# ... and load existing weights if the LoRAs are pretrained ...
+```
+
+## Getting Started
+
+### Install
+
+```bash
+# inference only
+pip install refiners
+```
+
+Or:
+
+```bash
+# inference + training
+pip install 'refiners[training]'
+```
+
+### Hello World
+
+Here is how to perform a text-to-image inference using the Stable Diffusion 1.5 foundational model patched with a Pokemon LoRA:
+
+Step 1: prepare the model weights in refiners' format:
+
+```bash
+python scripts/convert-clip-weights.py --output-file CLIPTextEncoderL.safetensors
+python scripts/convert-sd-lda-weights.py --output-file lda.safetensors
+python scripts/convert-sd-unet-weights.py --output-file unet.safetensors
+```
+
+> Note: this will download the original weights from https://huggingface.co/runwayml/stable-diffusion-v1-5 which takes some time. If you already have this repo cloned locally, use the `--from /path/to/stable-diffusion-v1-5` option instead.
+
+Step 2: download and convert a community Pokemon LoRA, e.g. [this one](https://huggingface.co/pcuenq/pokemon-lora)
+
+```bash
+curl -LO https://huggingface.co/pcuenq/pokemon-lora/resolve/main/pytorch_lora_weights.bin
+python scripts/convert-lora-weights.py \
+  --from pytorch_lora_weights.bin \
+  --output-file pokemon_lora.safetensors
+```
+
+Step 3: run inference using the GPU:
+
+```python
+from refiners.foundationals.latent_diffusion import StableDiffusion_1
+from refiners.foundationals.latent_diffusion.lora import LoraWeights
+from refiners.fluxion.utils import load_from_safetensors, manual_seed
+import torch
+
+
+sd15 = StableDiffusion_1(device="cuda")
+sd15.clip_text_encoder.load_state_dict(load_from_safetensors("CLIPTextEncoderL.safetensors"))
+sd15.lda.load_state_dict(load_from_safetensors("lda.safetensors"))
+sd15.unet.load_state_dict(load_from_safetensors("unet.safetensors"))
+
+# This uses the LoraAdapter internally and takes care to inject it where it should
+lora_weights = LoraWeights("pokemon_lora.safetensors", device=sd15.device)
+lora_weights.patch(sd15, scale=1.0)
+
+prompt = "a cute cat"
+
+with torch.no_grad():
+    clip_text_embedding = sd15.compute_text_embedding(prompt)
+
+sd15.set_num_inference_steps(30)
+
+manual_seed(2)
+x = torch.randn(1, 4, 64, 64, device=sd15.device)
+
+with torch.no_grad():
+    for step in sd15.steps:
+        x = sd15(
+            x,
+            step=step,
+            clip_text_embedding=clip_text_embedding,
+            condition_scale=7.5,
+        )
+    predicted_image = sd15.lda.decode_latents(x)
+    predicted_image.save("pokemon_cat.png")
+```
+
+You should get:
+
+![pokemon cat output](assets/pokemon_cat.png)
+
+## Training
+
+Refiners has a built-in training utils library and provides scripts that can be used as a starting point.
+
+E.g. to train a LoRA on top of Stable Diffusion, copy and edit `configs/finetune-lora.toml` to suit your needs and launch the training as follows:
+
+```bash
+python scripts/training/finetune-ldm-lora.py configs/finetune-lora.toml
+```
+
+## Credits
+
+We took inspiration from these great projects:
+
+- [tinygrad](https://github.com/tinygrad/tinygrad) - For something between PyTorch and [karpathy/micrograd](https://github.com/karpathy/micrograd)
+- [Composer](https://github.com/mosaicml/composer) - A PyTorch Library for Efficient Neural Network Training
+- [Keras](https://github.com/keras-team/keras) - Deep Learning for humans
+
+## Citation
+
+```bibtex
+@misc{the-finegrain-team-2023-refiners,
+  author = {Benjamin Trom and Pierre Chapuis and Cédric Deltheil},
+  title = {Refiners: The simplest way to train and run adapters on top of foundational models},
+  year = {2023},
+  publisher = {GitHub},
+  journal = {GitHub repository},
+  howpublished = {\url{https://github.com/finegrain-ai/refiners}}
+}
+```
--- a/assets/dropy.png
+++ b/assets/dropy.png
--- a/assets/logo_dark.png
+++ b/assets/logo_dark.png
--- a/assets/logo_light.png
+++ b/assets/logo_light.png
--- a/assets/pokemon_cat.png
+++ b/assets/pokemon_cat.png
--- a/configs/finetune-ldm.toml
+++ b/configs/finetune-ldm.toml
@ -0,0 +1,55 @@
+script = "finetune-ldm.py" # not used for now
+
+[wandb]
+offline = "offline"
+entity = "acme"
+project = "test-ldm-training"
+
+[models]
+lda = {checkpoint="/path/to/stable-diffusion-1-5/lda.safetensors", train=false}
+text_encoder = {checkpoint="/path/to/stable-diffusion-1-5/text_encoder.safetensors", train=true}
+unet = {checkpoint="/path/to/stable-diffusion-1-5/unet.safetensors", train=true}
+
+[latent_diffusion]
+unconditional_sampling_probability = 0.2
+offset_noise = 0.1
+
+[training]
+duration = "1:epoch"
+seed = 0
+gpu_index = 0
+num_epochs = 1
+batch_size = 1
+gradient_accumulation = "1:step"
+clip_grad_norm = 2.0
+clip_grad_value = 1.0
+evaluation_interval = "1:epoch"
+evaluation_seed = 0
+
+
+[optimizer]
+optimizer = "AdamW" # "AdamW", "AdamW8bit", "Lion8bit", "Prodigy", "SGD", "Adam"
+learning_rate = 1e-5
+betas = [0.9, 0.999]
+eps = 1e-8
+weight_decay = 1e-2
+
+
+[scheduler]
+
+
+[dropout]
+dropout_probability = 0.2
+
+[dataset]
+hf_repo = "acme/images"
+revision = "main"
+
+[checkpointing]
+# save_folder = "/path/to/ckpts"
+save_interval = "1:epoch"
+
+[test_diffusion]
+prompts = [
+    "A cute cat",
+]
--- a/configs/finetune-lora.toml
+++ b/configs/finetune-lora.toml
@ -0,0 +1,70 @@
+script = "finetune-ldm-lora.py" # not used for now
+
+[wandb]
+mode = "offline" # "online", "offline", "disabled"
+entity = "acme"
+project = "test-lora-training"
+
+[models]
+unet = {checkpoint = "/path/to/stable-diffusion-1-5/unet.safetensors"}
+text_encoder = {checkpoint = "/path/to/stable-diffusion-1-5/CLIPTextEncoderL.safetensors"}
+lda = {checkpoint = "/path/to/stable-diffusion-1-5/lda.safetensors"}
+
+[latent_diffusion]
+unconditional_sampling_probability = 0.05
+offset_noise = 0.1
+
+[lora]
+rank = 16
+trigger_phrase = "a spsh photo,"
+use_only_trigger_probability = 1.0
+unet_targets = ["CrossAttentionBlock2d"]
+text_encoder_targets = ["TransformerLayer"]
+lda_targets = []
+
+[training]
+duration = "1000:epoch"
+seed = 0
+gpu_index = 0
+batch_size = 4
+gradient_accumulation = "4:step"
+clip_grad_norm = 1.0
+# clip_grad_value = 1.0
+evaluation_interval = "5:epoch"
+evaluation_seed = 1
+
+
+[optimizer]
+optimizer = "Prodigy" # "SGD", "Adam", "AdamW", "AdamW8bit", "Lion8bit"
+learning_rate = 1
+betas = [0.9, 0.999]
+eps = 1e-8
+weight_decay = 1e-2
+
+[scheduler]
+scheduler_type = "ConstantLR"
+update_interval = "1:step"
+warmup = "500:step"
+
+
+[dropout]
+dropout_probability = 0.2
+use_gyro_dropout = false
+
+[dataset]
+hf_repo = "acme/images"
+revision = "main"
+
+[checkpointing]
+# save_folder = "/path/to/ckpts"
+save_interval = "1:step"
+
+[test_diffusion]
+num_inference_steps = 30
+use_short_prompts = false
+prompts = [
+        "a cute cat",
+        "a cute dog",
+        "a cute bird",
+        "a cute horse",
+]
--- a/poetry.lock
+++ b/poetry.lock
--- a/pyproject.toml
+++ b/pyproject.toml
@ -0,0 +1,67 @@
+[tool.poetry]
+name = "refiners"
+version = "0.1.0"
+description = "The simplest way to train and run adapters on top of foundational models"
+authors = [
+    "The Finegrain Team <bonjour@lagon.tech>",
+]
+license = "MIT"
+readme = "README.md"
+packages = [{include = "refiners", from = "src"}]
+
+[tool.poetry.dependencies]
+python = ">=3.10,<3.12"
+jaxtyping = "^0.2.14"
+torch = "^2.0.0"
+safetensors = "^0.3.0"
+numpy = "^1.24.2"
+pillow = "^9.5.0"
+datasets = {version = "^2.14.0", optional = true}
+tomli = {version = "^2.0.1", optional = true}
+wandb = {version = "^0.15.7", optional = true}
+loguru = {version = "^0.7.0", optional = true}
+bitsandbytes = {version = "^0.41.0", optional = true}
+prodigyopt = {version = "^1.0", optional = true}
+pydantic = {git = "https://github.com/pydantic/pydantic.git", rev = "v2.0b3", optional = true}
+scipy = {version = "^1.11.1", optional = true}
+
+
+[tool.poetry.extras]
+training = ["datasets", "tomli", "wandb", "loguru", "bitsandbytes", "prodigyopt", "pydantic", "scipy"]
+
+[tool.poetry.group.dev.dependencies]
+black = "^23.1.0"
+pytest = "^7.2.2"
+isort = "^5.12.0"
+ipykernel = "^6.22.0"
+pyright = "^1.1.318"
+ruff = "^0.0.281"
+
+
+[tool.poetry.group.test.dependencies]
+diffusers = "^0.18.0"
+transformers = "^4.27.4"
+piq = "^0.7.1"
+invisible-watermark = "^0.2.0"
+
+
+[build-system]
+requires = ["poetry-core"]
+build-backend = "poetry.core.masonry.api"
+
+[tool.black]
+line-length = 120
+preview = true
+
+[tool.ruff]
+ignore = [
+    "F722",  # forward-annotation-syntax-error, because of Jaxtyping
+    "E731",  # do-not-assign-lambda
+    "E501",  # line-too-long, because Black (https://beta.ruff.rs/docs/faq/#is-ruff-compatible-with-black)
+]
+line-length = 120
+
+[tool.pyright]
+include = ["src/refiners", "tests", "scripts/training"]
+exclude = ["**/__pycache__"]
+reportMissingTypeStubs = "warning"
--- a/scripts/convert-clip-weights.py
+++ b/scripts/convert-clip-weights.py
@ -0,0 +1,50 @@
+import torch
+
+from safetensors.torch import save_file
+from refiners.fluxion.utils import (
+    create_state_dict_mapping,
+    convert_state_dict,
+)
+
+from diffusers import DiffusionPipeline
+from transformers.models.clip.modeling_clip import CLIPTextModel
+
+from refiners.foundationals.clip.text_encoder import CLIPTextEncoderL
+
+
+@torch.no_grad()
+def convert(src_model: CLIPTextModel) -> dict[str, torch.Tensor]:
+    dst_model = CLIPTextEncoderL()
+    x = dst_model.tokenizer("Nice cat", sequence_length=77)
+    mapping = create_state_dict_mapping(src_model, dst_model, [x])
+    state_dict = convert_state_dict(src_model.state_dict(), dst_model.state_dict(), mapping)
+    return {k: v.half() for k, v in state_dict.items()}
+
+
+def main():
+    import argparse
+
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--from",
+        type=str,
+        dest="source",
+        required=False,
+        default="runwayml/stable-diffusion-v1-5",
+        help="Source model",
+    )
+    parser.add_argument(
+        "--output-file",
+        type=str,
+        required=False,
+        default="CLIPTextEncoderL.safetensors",
+        help="Path for the output file",
+    )
+    args = parser.parse_args()
+    src_model = DiffusionPipeline.from_pretrained(args.source).text_encoder
+    tensors = convert(src_model)
+    save_file(tensors, args.output_file)
+
+
+if __name__ == "__main__":
+    main()
--- a/scripts/convert-controlnet-weights.py
+++ b/scripts/convert-controlnet-weights.py
@ -0,0 +1,203 @@
+import torch
+from diffusers import ControlNetModel
+from safetensors.torch import save_file
+from refiners.fluxion.utils import (
+    forward_order_of_execution,
+    verify_shape_match,
+    convert_state_dict,
+)
+from refiners.foundationals.latent_diffusion.controlnet import Controlnet
+from refiners.foundationals.latent_diffusion.schedulers.dpm_solver import DPMSolver
+from refiners.foundationals.latent_diffusion import UNet
+
+
+@torch.no_grad()
+def convert(controlnet_src: ControlNetModel) -> dict[str, torch.Tensor]:
+    controlnet = Controlnet(name="mycn")
+
+    condition = torch.randn(1, 3, 512, 512)
+    controlnet.set_controlnet_condition(condition)
+
+    unet = UNet(4, clip_embedding_dim=768)
+    unet.insert(0, controlnet)
+    clip_text_embedding = torch.rand(1, 77, 768)
+    unet.set_clip_text_embedding(clip_text_embedding)
+
+    scheduler = DPMSolver(num_inference_steps=10)
+    timestep = scheduler.timesteps[0].unsqueeze(0)
+    unet.set_timestep(timestep.unsqueeze(0))
+
+    x = torch.randn(1, 4, 64, 64)
+
+    # We need the hack below because our implementation is not strictly equivalent
+    # to diffusers in order, since we compute the residuals inline instead of
+    # in a separate step.
+
+    source_order = forward_order_of_execution(controlnet_src, (x, timestep, clip_text_embedding, condition))
+    target_order = forward_order_of_execution(controlnet, (x,))
+
+    broken_k = ("Conv2d", (torch.Size([320, 320, 1, 1]), torch.Size([320])))
+
+    expected_source_order = [
+        "down_blocks.0.attentions.0.proj_in",
+        "down_blocks.0.attentions.0.proj_out",
+        "down_blocks.0.attentions.1.proj_in",
+        "down_blocks.0.attentions.1.proj_out",
+        "controlnet_down_blocks.0",
+        "controlnet_down_blocks.1",
+        "controlnet_down_blocks.2",
+        "controlnet_down_blocks.3",
+    ]
+
+    expected_target_order = [
+        "DownBlocks.Chain_1.Passthrough.Conv2d",
+        "DownBlocks.Chain_2.CLIPLCrossAttention.Chain.Chain_1.Conv2d",
+        "DownBlocks.Chain_2.CLIPLCrossAttention.Chain.Chain_3.Conv2d",
+        "DownBlocks.Chain_2.Passthrough.Conv2d",
+        "DownBlocks.Chain_3.CLIPLCrossAttention.Chain.Chain_1.Conv2d",
+        "DownBlocks.Chain_3.CLIPLCrossAttention.Chain.Chain_3.Conv2d",
+        "DownBlocks.Chain_3.Passthrough.Conv2d",
+        "DownBlocks.Chain_4.Passthrough.Conv2d",
+    ]
+
+    fixed_source_order = [
+        "controlnet_down_blocks.0",
+        "down_blocks.0.attentions.0.proj_in",
+        "down_blocks.0.attentions.0.proj_out",
+        "controlnet_down_blocks.1",
+        "down_blocks.0.attentions.1.proj_in",
+        "down_blocks.0.attentions.1.proj_out",
+        "controlnet_down_blocks.2",
+        "controlnet_down_blocks.3",
+    ]
+
+    assert source_order[broken_k] == expected_source_order
+    assert target_order[broken_k] == expected_target_order
+    source_order[broken_k] = fixed_source_order
+
+    broken_k = ("Conv2d", (torch.Size([640, 640, 1, 1]), torch.Size([640])))
+
+    expected_source_order = [
+        "down_blocks.1.attentions.0.proj_in",
+        "down_blocks.1.attentions.0.proj_out",
+        "down_blocks.1.attentions.1.proj_in",
+        "down_blocks.1.attentions.1.proj_out",
+        "controlnet_down_blocks.4",
+        "controlnet_down_blocks.5",
+        "controlnet_down_blocks.6",
+    ]
+
+    expected_target_order = [
+        "DownBlocks.Chain_5.CLIPLCrossAttention.Chain.Chain_1.Conv2d",
+        "DownBlocks.Chain_5.CLIPLCrossAttention.Chain.Chain_3.Conv2d",
+        "DownBlocks.Chain_5.Passthrough.Conv2d",
+        "DownBlocks.Chain_6.CLIPLCrossAttention.Chain.Chain_1.Conv2d",
+        "DownBlocks.Chain_6.CLIPLCrossAttention.Chain.Chain_3.Conv2d",
+        "DownBlocks.Chain_6.Passthrough.Conv2d",
+        "DownBlocks.Chain_7.Passthrough.Conv2d",
+    ]
+
+    fixed_source_order = [
+        "down_blocks.1.attentions.0.proj_in",
+        "down_blocks.1.attentions.0.proj_out",
+        "controlnet_down_blocks.4",
+        "down_blocks.1.attentions.1.proj_in",
+        "down_blocks.1.attentions.1.proj_out",
+        "controlnet_down_blocks.5",
+        "controlnet_down_blocks.6",
+    ]
+
+    assert source_order[broken_k] == expected_source_order
+    assert target_order[broken_k] == expected_target_order
+    source_order[broken_k] = fixed_source_order
+
+    broken_k = ("Conv2d", (torch.Size([1280, 1280, 1, 1]), torch.Size([1280])))
+
+    expected_source_order = [
+        "down_blocks.2.attentions.0.proj_in",
+        "down_blocks.2.attentions.0.proj_out",
+        "down_blocks.2.attentions.1.proj_in",
+        "down_blocks.2.attentions.1.proj_out",
+        "mid_block.attentions.0.proj_in",
+        "mid_block.attentions.0.proj_out",
+        "controlnet_down_blocks.7",
+        "controlnet_down_blocks.8",
+        "controlnet_down_blocks.9",
+        "controlnet_down_blocks.10",
+        "controlnet_down_blocks.11",
+        "controlnet_mid_block",
+    ]
+
+    expected_target_order = [
+        "DownBlocks.Chain_8.CLIPLCrossAttention.Chain.Chain_1.Conv2d",
+        "DownBlocks.Chain_8.CLIPLCrossAttention.Chain.Chain_3.Conv2d",
+        "DownBlocks.Chain_8.Passthrough.Conv2d",
+        "DownBlocks.Chain_9.CLIPLCrossAttention.Chain.Chain_1.Conv2d",
+        "DownBlocks.Chain_9.CLIPLCrossAttention.Chain.Chain_3.Conv2d",
+        "DownBlocks.Chain_9.Passthrough.Conv2d",
+        "DownBlocks.Chain_10.Passthrough.Conv2d",
+        "DownBlocks.Chain_11.Passthrough.Conv2d",
+        "DownBlocks.Chain_12.Passthrough.Conv2d",
+        "MiddleBlock.CLIPLCrossAttention.Chain.Chain_1.Conv2d",
+        "MiddleBlock.CLIPLCrossAttention.Chain.Chain_3.Conv2d",
+        "MiddleBlock.Passthrough.Conv2d",
+    ]
+
+    fixed_source_order = [
+        "down_blocks.2.attentions.0.proj_in",
+        "down_blocks.2.attentions.0.proj_out",
+        "controlnet_down_blocks.7",
+        "down_blocks.2.attentions.1.proj_in",
+        "down_blocks.2.attentions.1.proj_out",
+        "controlnet_down_blocks.8",
+        "controlnet_down_blocks.9",
+        "controlnet_down_blocks.10",
+        "controlnet_down_blocks.11",
+        "mid_block.attentions.0.proj_in",
+        "mid_block.attentions.0.proj_out",
+        "controlnet_mid_block",
+    ]
+
+    assert source_order[broken_k] == expected_source_order
+    assert target_order[broken_k] == expected_target_order
+    source_order[broken_k] = fixed_source_order
+
+    assert verify_shape_match(source_order, target_order)
+
+    mapping: dict[str, str] = {}
+    for model_type_shape in source_order:
+        source_keys = source_order[model_type_shape]
+        target_keys = target_order[model_type_shape]
+        mapping.update(zip(target_keys, source_keys))
+
+    state_dict = convert_state_dict(controlnet_src.state_dict(), controlnet.state_dict(), state_dict_mapping=mapping)
+
+    return {k: v.half() for k, v in state_dict.items()}
+
+
+def main():
+    import argparse
+
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--from",
+        type=str,
+        dest="source",
+        required=True,
+        help="Source model",
+    )
+    parser.add_argument(
+        "--output-file",
+        type=str,
+        required=False,
+        default="output.safetensors",
+        help="Path for the output file",
+    )
+    args = parser.parse_args()
+    controlnet_src = ControlNetModel.from_pretrained(args.source)
+    tensors = convert(controlnet_src)
+    save_file(tensors, args.output_file)
+
+
+if __name__ == "__main__":
+    main()
--- a/scripts/convert-lora-weights.py
+++ b/scripts/convert-lora-weights.py
@ -0,0 +1,115 @@
+# Note: this conversion script currently only support simple LoRAs which adapt
+# the UNet's attentions such as https://huggingface.co/pcuenq/pokemon-lora
+
+import torch
+from torch.nn.init import zeros_
+from torch.nn import Parameter as TorchParameter
+
+import refiners.fluxion.layers as fl
+
+from refiners.fluxion.utils import save_to_safetensors
+from refiners.foundationals.latent_diffusion.unet import UNet
+from refiners.foundationals.latent_diffusion.lora import LoraTarget, apply_loras_to_target
+from refiners.adapters.lora import Lora
+from refiners.fluxion.utils import create_state_dict_mapping
+
+from diffusers import DiffusionPipeline
+
+
+def get_weight(linear: fl.Linear) -> torch.Tensor:
+    assert linear.bias is None
+    return linear.state_dict()["weight"]
+
+
+def build_loras_safetensors(module: fl.Chain, key_prefix: str) -> dict[str, torch.Tensor]:
+    weights: list[torch.Tensor] = []
+    for lora in module.layers(layer_type=Lora):
+        linears = list(lora.layers(fl.Linear))
+        assert len(linears) == 2
+        weights.extend((get_weight(linears[1]), get_weight(linears[0])))  # aka (up_weight, down_weight)
+    return {f"{key_prefix}{i:03d}": w for i, w in enumerate(weights)}
+
+
+@torch.no_grad()
+def process(source: str, base_model: str, output_file: str) -> None:
+    diffusers_state_dict = torch.load(source, map_location="cpu")  # type: ignore
+    diffusers_sd = DiffusionPipeline.from_pretrained(base_model)  # type: ignore
+    diffusers_model = diffusers_sd.unet
+
+    refiners_model = UNet(in_channels=4, clip_embedding_dim=768)
+    target = LoraTarget.CrossAttention
+    metadata = {"unet_targets": "CrossAttentionBlock2d"}
+    rank = diffusers_state_dict[
+        "mid_block.attentions.0.transformer_blocks.0.attn1.processor.to_q_lora.down.weight"
+    ].shape[0]
+
+    x = torch.randn(1, 4, 32, 32)
+    timestep = torch.tensor([0])
+    clip_text_embeddings = torch.randn(1, 77, 768)
+
+    refiners_model.set_timestep(timestep)
+    refiners_model.set_clip_text_embedding(clip_text_embeddings)
+    refiners_args = (x,)
+
+    diffusers_args = (x, timestep, clip_text_embeddings)
+
+    diffusers_to_refiners = create_state_dict_mapping(refiners_model, diffusers_model, refiners_args, diffusers_args)
+    assert diffusers_to_refiners
+
+    apply_loras_to_target(refiners_model, target=LoraTarget(target), rank=rank, scale=1.0)
+    for layer in refiners_model.layers(layer_type=Lora):
+        zeros_(layer.Linear_1.weight)
+
+    targets = {k.split("_lora.")[0] for k in diffusers_state_dict.keys()}
+    for target_k in targets:
+        k_p, k_s = target_k.split(".processor.")
+        r = [v for k, v in diffusers_to_refiners.items() if k.startswith(f"{k_p}.{k_s}")]
+        assert len(r) == 1
+        orig_k = r[0]
+        orig_path = orig_k.split(".")
+        p = refiners_model
+        for seg in orig_path[:-1]:
+            p = p[seg]
+        last_seg = (
+            "LoraAdapter" if orig_path[-1] == "Linear" else f"LoraAdapter_{orig_path[-1].removeprefix('Linear_')}"
+        )
+        p_down = TorchParameter(diffusers_state_dict[f"{target_k}_lora.down.weight"])
+        p_up = TorchParameter(diffusers_state_dict[f"{target_k}_lora.up.weight"])
+        p[last_seg].Lora.load_weights(p_down, p_up)
+
+    state_dict = build_loras_safetensors(refiners_model, key_prefix="unet.")
+    assert len(state_dict) == 320
+    save_to_safetensors(output_file, tensors=state_dict, metadata=metadata)
+
+
+def main():
+    import argparse
+
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--from",
+        type=str,
+        dest="source",
+        required=True,
+        help="Source file path (.bin)",
+    )
+    parser.add_argument(
+        "--base-model",
+        type=str,
+        required=False,
+        default="runwayml/stable-diffusion-v1-5",
+        help="Base model",
+    )
+    parser.add_argument(
+        "--output-file",
+        type=str,
+        required=False,
+        default="output.safetensors",
+        help="Path for the output file",
+    )
+    args = parser.parse_args()
+    process(source=args.source, base_model=args.base_model, output_file=args.output_file)
+
+
+if __name__ == "__main__":
+    main()
--- a/scripts/convert-loras-to-sdwebui.py
+++ b/scripts/convert-loras-to-sdwebui.py
@ -0,0 +1,134 @@
+from refiners.fluxion.utils import load_from_safetensors, load_metadata_from_safetensors, save_to_safetensors
+from refiners.foundationals.clip.text_encoder import CLIPTextEncoderL
+from refiners.foundationals.latent_diffusion.unet import UNet
+from refiners.foundationals.latent_diffusion.lora import LoraTarget
+from refiners.fluxion.layers.module import Module
+import refiners.fluxion.layers as fl
+from refiners.fluxion.utils import create_state_dict_mapping
+
+import torch
+
+from diffusers import DiffusionPipeline
+from diffusers.models.unet_2d_condition import UNet2DConditionModel
+from transformers.models.clip.modeling_clip import CLIPTextModel
+
+
+@torch.no_grad()
+def create_unet_mapping(src_model: UNet2DConditionModel, dst_model: UNet) -> dict[str, str] | None:
+    x = torch.randn(1, 4, 32, 32)
+    timestep = torch.tensor([0])
+    clip_text_embeddings = torch.randn(1, 77, 768)
+
+    src_args = (x, timestep, clip_text_embeddings)
+    dst_model.set_timestep(timestep)
+    dst_model.set_clip_text_embedding(clip_text_embeddings)
+    dst_args = (x,)
+
+    return create_state_dict_mapping(src_model, dst_model, src_args, dst_args)  # type: ignore
+
+
+@torch.no_grad()
+def create_text_encoder_mapping(src_model: CLIPTextModel, dst_model: CLIPTextEncoderL) -> dict[str, str] | None:
+    x = dst_model.tokenizer("Nice cat", sequence_length=77)
+
+    return create_state_dict_mapping(src_model, dst_model, [x])  # type: ignore
+
+
+def main():
+    import argparse
+
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "-i",
+        "--input-file",
+        type=str,
+        required=True,
+        help="Path to the input file with refiner's LoRA weights (safetensors format)",
+    )
+    parser.add_argument(
+        "-o",
+        "--output-file",
+        type=str,
+        required=True,
+        help="Path to the output file with sd-webui's LoRA weights (safetensors format)",
+    )
+    parser.add_argument(
+        "--sd15",
+        type=str,
+        required=False,
+        default="runwayml/stable-diffusion-v1-5",
+        help="Path (preferred) or repository ID of Stable Diffusion 1.5 model (Hugging Face diffusers format)",
+    )
+    args = parser.parse_args()
+
+    metadata = load_metadata_from_safetensors(args.input_file)
+    assert metadata is not None
+    tensors = load_from_safetensors(args.input_file)
+
+    diffusers_sd = DiffusionPipeline.from_pretrained(args.sd15)  # type: ignore
+
+    state_dict: dict[str, torch.Tensor] = {}
+
+    for meta_key, meta_value in metadata.items():
+        match meta_key:
+            case "unet_targets":
+                src_model = diffusers_sd.unet  # type: ignore
+                dst_model = UNet(in_channels=4, clip_embedding_dim=768)
+                create_mapping = create_unet_mapping
+                key_prefix = "unet."
+                lora_prefix = "lora_unet_"
+            case "text_encoder_targets":
+                src_model = diffusers_sd.text_encoder  # type: ignore
+                dst_model = CLIPTextEncoderL()
+                create_mapping = create_text_encoder_mapping
+                key_prefix = "text_encoder."
+                lora_prefix = "lora_te_"
+            case "lda_targets":
+                raise ValueError("SD-WebUI does not support LoRA for the auto-encoder")
+            case _:
+                raise ValueError(f"Unexpected key in checkpoint metadata: {meta_key}")
+
+        submodule_to_key: dict[Module, str] = {}
+        for name, submodule in dst_model.named_modules():
+            submodule_to_key[submodule] = name
+
+        # SD-WebUI expects LoRA state dicts with keys derived from the diffusers format, e.g.:
+        #
+        #     lora_unet_down_blocks_0_attentions_0_proj_in.alpha
+        #     lora_unet_down_blocks_0_attentions_0_proj_in.lora_down.weight
+        #     lora_unet_down_blocks_0_attentions_0_proj_in.lora_up.weight
+        #     ...
+        #
+        # Internally SD-WebUI has some logic[1] to convert such keys into the CompVis format. See
+        # `convert_diffusers_name_to_compvis` for more details.
+        #
+        # [1]: https://github.com/AUTOMATIC1111/stable-diffusion-webui/blob/394ffa7/extensions-builtin/Lora/lora.py#L158-L225
+
+        refiners_to_diffusers = create_mapping(src_model, dst_model)  # type: ignore
+        assert refiners_to_diffusers is not None
+
+        # Compute the corresponding diffusers' keys where LoRA layers must be applied
+        lora_injection_points: list[str] = [
+            refiners_to_diffusers[submodule_to_key[linear]]
+            for target in [LoraTarget(t) for t in meta_value.split(",")]
+            for layer in dst_model.layers(layer_type=target.get_class())
+            for linear in layer.layers(fl.Linear)
+        ]
+
+        lora_weights = [w for w in [tensors[k] for k in sorted(tensors) if k.startswith(key_prefix)]]
+        assert len(lora_injection_points) == len(lora_weights) // 2
+
+        # Map LoRA weights to each key using SD-WebUI conventions (proper prefix and suffix, underscores)
+        for i, diffusers_key in enumerate(lora_injection_points):
+            lora_key = lora_prefix + diffusers_key.replace(".", "_")
+            # Note: no ".alpha" weights (those are used to scale the LoRA by alpha/rank). Refiners uses a scale = 1.0
+            # by default (see `lora_calc_updown` in SD-WebUI for more details)
+            state_dict[lora_key + ".lora_up.weight"] = lora_weights[2 * i]
+            state_dict[lora_key + ".lora_down.weight"] = lora_weights[2 * i + 1]
+
+    assert state_dict
+    save_to_safetensors(args.output_file, state_dict)
+
+
+if __name__ == "__main__":
+    main()
--- a/scripts/convert-sd-lda-weights.py
+++ b/scripts/convert-sd-lda-weights.py
@ -0,0 +1,50 @@
+import torch
+
+from safetensors.torch import save_file
+from refiners.fluxion.utils import (
+    create_state_dict_mapping,
+    convert_state_dict,
+)
+
+from diffusers import DiffusionPipeline
+from diffusers.models.autoencoder_kl import AutoencoderKL
+
+from refiners.foundationals.latent_diffusion.auto_encoder import LatentDiffusionAutoencoder
+
+
+@torch.no_grad()
+def convert(src_model: AutoencoderKL) -> dict[str, torch.Tensor]:
+    dst_model = LatentDiffusionAutoencoder()
+    x = torch.randn(1, 3, 512, 512)
+    mapping = create_state_dict_mapping(src_model, dst_model, [x])
+    state_dict = convert_state_dict(src_model.state_dict(), dst_model.state_dict(), mapping)
+    return {k: v.half() for k, v in state_dict.items()}
+
+
+def main():
+    import argparse
+
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--from",
+        type=str,
+        dest="source",
+        required=False,
+        default="runwayml/stable-diffusion-v1-5",
+        help="Source model",
+    )
+    parser.add_argument(
+        "--output-file",
+        type=str,
+        required=False,
+        default="lda.safetensors",
+        help="Path for the output file",
+    )
+    args = parser.parse_args()
+    src_model = DiffusionPipeline.from_pretrained(args.source).vae
+    tensors = convert(src_model)
+    save_file(tensors, args.output_file)
+
+
+if __name__ == "__main__":
+    main()
--- a/scripts/convert-sd-unet-inpainting-weights.py
+++ b/scripts/convert-sd-unet-inpainting-weights.py
@ -0,0 +1,59 @@
+import torch
+
+from safetensors.torch import save_file
+from refiners.fluxion.utils import (
+    create_state_dict_mapping,
+    convert_state_dict,
+)
+
+from diffusers import StableDiffusionInpaintPipeline
+from diffusers.models.unet_2d_condition import UNet2DConditionModel
+
+from refiners.foundationals.latent_diffusion.unet import UNet
+
+
+@torch.no_grad()
+def convert(src_model: UNet2DConditionModel) -> dict[str, torch.Tensor]:
+    dst_model = UNet(in_channels=9, clip_embedding_dim=768)
+
+    x = torch.randn(1, 9, 32, 32)
+    timestep = torch.tensor([0])
+    clip_text_embeddings = torch.randn(1, 77, 768)
+
+    src_args = (x, timestep, clip_text_embeddings)
+    dst_model.set_timestep(timestep)
+    dst_model.set_clip_text_embedding(clip_text_embeddings)
+    dst_args = (x,)
+
+    mapping = create_state_dict_mapping(src_model, dst_model, src_args, dst_args)
+    state_dict = convert_state_dict(src_model.state_dict(), dst_model.state_dict(), mapping)
+    return {k: v.half() for k, v in state_dict.items()}
+
+
+def main():
+    import argparse
+
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--from",
+        type=str,
+        dest="source",
+        required=False,
+        default="runwayml/stable-diffusion-inpainting",
+        help="Source model",
+    )
+    parser.add_argument(
+        "--output-file",
+        type=str,
+        required=False,
+        default="stable_diffusion_1_5_inpainting_unet.safetensors",
+        help="Path for the output file",
+    )
+    args = parser.parse_args()
+    src_model = StableDiffusionInpaintPipeline.from_pretrained(args.source).unet
+    tensors = convert(src_model)
+    save_file(tensors, args.output_file)
+
+
+if __name__ == "__main__":
+    main()
--- a/scripts/convert-sd-unet-weights.py
+++ b/scripts/convert-sd-unet-weights.py
@ -0,0 +1,59 @@
+import torch
+
+from safetensors.torch import save_file
+from refiners.fluxion.utils import (
+    create_state_dict_mapping,
+    convert_state_dict,
+)
+
+from diffusers import DiffusionPipeline
+from diffusers.models.unet_2d_condition import UNet2DConditionModel
+
+from refiners.foundationals.latent_diffusion.unet import UNet
+
+
+@torch.no_grad()
+def convert(src_model: UNet2DConditionModel) -> dict[str, torch.Tensor]:
+    dst_model = UNet(in_channels=4, clip_embedding_dim=768)
+
+    x = torch.randn(1, 4, 32, 32)
+    timestep = torch.tensor([0])
+    clip_text_embeddings = torch.randn(1, 77, 768)
+
+    src_args = (x, timestep, clip_text_embeddings)
+    dst_model.set_timestep(timestep)
+    dst_model.set_clip_text_embedding(clip_text_embeddings)
+    dst_args = (x,)
+
+    mapping = create_state_dict_mapping(src_model, dst_model, src_args, dst_args)
+    state_dict = convert_state_dict(src_model.state_dict(), dst_model.state_dict(), mapping)
+    return {k: v.half() for k, v in state_dict.items()}
+
+
+def main():
+    import argparse
+
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--from",
+        type=str,
+        dest="source",
+        required=False,
+        default="runwayml/stable-diffusion-v1-5",
+        help="Source model",
+    )
+    parser.add_argument(
+        "--output-file",
+        type=str,
+        required=False,
+        default="stable_diffusion_1_5_unet.safetensors",
+        help="Path for the output file",
+    )
+    args = parser.parse_args()
+    src_model = DiffusionPipeline.from_pretrained(args.source).unet
+    tensors = convert(src_model)
+    save_file(tensors, args.output_file)
+
+
+if __name__ == "__main__":
+    main()
--- a/scripts/convert-sdxl-text-encoder-2.py
+++ b/scripts/convert-sdxl-text-encoder-2.py
@ -0,0 +1,57 @@
+import torch
+
+from safetensors.torch import save_file  # type: ignore
+from refiners.fluxion.utils import (
+    create_state_dict_mapping,
+    convert_state_dict,
+)
+
+from diffusers import DiffusionPipeline  # type: ignore
+from transformers.models.clip.modeling_clip import CLIPTextModel  # type: ignore
+
+from refiners.foundationals.clip.text_encoder import CLIPTextEncoderG
+import refiners.fluxion.layers as fl
+
+
+@torch.no_grad()
+def convert(src_model: CLIPTextModel) -> dict[str, torch.Tensor]:
+    dst_model = CLIPTextEncoderG()
+    # Extra projection layer (see CLIPTextModelWithProjection in transformers)
+    dst_model.append(module=fl.Linear(in_features=1280, out_features=1280, bias=False))
+    x = dst_model.tokenizer("Nice cat", sequence_length=77)
+    mapping = create_state_dict_mapping(source_model=src_model, target_model=dst_model, source_args=[x])  # type: ignore
+    if mapping is None:
+        raise RuntimeError("Could not create state dict mapping")
+    state_dict = convert_state_dict(
+        source_state_dict=src_model.state_dict(), target_state_dict=dst_model.state_dict(), state_dict_mapping=mapping
+    )
+    return {k: v.half() for k, v in state_dict.items()}
+
+
+def main():
+    import argparse
+
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--from",
+        type=str,
+        dest="source",
+        required=False,
+        default="stabilityai/stable-diffusion-xl-base-0.9",
+        help="Source model",
+    )
+    parser.add_argument(
+        "--output-file",
+        type=str,
+        required=False,
+        default="CLIPTextEncoderG.safetensors",
+        help="Path for the output file",
+    )
+    args = parser.parse_args()
+    src_model = DiffusionPipeline.from_pretrained(pretrained_model_name_or_path=args.source).text_encoder_2  # type: ignore
+    tensors = convert(src_model=src_model)
+    save_file(tensors=tensors, filename=args.output_file)
+
+
+if __name__ == "__main__":
+    main()
--- a/scripts/convert-sdxl-unet-weights.py
+++ b/scripts/convert-sdxl-unet-weights.py
@ -0,0 +1,68 @@
+import torch
+
+from safetensors.torch import save_file  # type: ignore
+from refiners.fluxion.utils import (
+    create_state_dict_mapping,
+    convert_state_dict,
+)
+
+from diffusers import DiffusionPipeline  # type: ignore
+from diffusers.models.unet_2d_condition import UNet2DConditionModel  # type: ignore
+
+from refiners.foundationals.latent_diffusion.sdxl_unet import SDXLUNet
+
+
+@torch.no_grad()
+def convert(src_model: UNet2DConditionModel) -> dict[str, torch.Tensor]:
+    dst_model = SDXLUNet(in_channels=4)
+
+    x = torch.randn(1, 4, 32, 32)
+    timestep = torch.tensor([0])
+    clip_text_embeddings = torch.randn(1, 77, 2048)
+
+    added_cond_kwargs = {"text_embeds": torch.randn(1, 1280), "time_ids": torch.randn(1, 6)}
+    src_args = (x, timestep, clip_text_embeddings, None, None, None, None, added_cond_kwargs)
+    dst_model.set_timestep(timestep=timestep)
+    dst_model.set_clip_text_embedding(clip_text_embedding=clip_text_embeddings)
+    dst_model.set_time_ids(time_ids=added_cond_kwargs["time_ids"])
+    dst_model.set_pooled_text_embedding(pooled_text_embedding=added_cond_kwargs["text_embeds"])
+    dst_args = (x,)
+
+    mapping = create_state_dict_mapping(
+        source_model=src_model, target_model=dst_model, source_args=src_args, target_args=dst_args  # type: ignore
+    )
+    if mapping is None:
+        raise RuntimeError("Could not create state dict mapping")
+    state_dict = convert_state_dict(
+        source_state_dict=src_model.state_dict(), target_state_dict=dst_model.state_dict(), state_dict_mapping=mapping
+    )
+    return {k: v for k, v in state_dict.items()}
+
+
+def main() -> None:
+    import argparse
+
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--from",
+        type=str,
+        dest="source",
+        required=False,
+        default="stabilityai/stable-diffusion-xl-base-0.9",
+        help="Source model",
+    )
+    parser.add_argument(
+        "--output-file",
+        type=str,
+        required=False,
+        default="stable_diffusion_xl_unet.safetensors",
+        help="Path for the output file",
+    )
+    args = parser.parse_args()
+    src_model = DiffusionPipeline.from_pretrained(pretrained_model_name_or_path=args.source).unet  # type: ignore
+    tensors = convert(src_model)
+    save_file(tensors, args.output_file)
+
+
+if __name__ == "__main__":
+    main()
--- a/scripts/training/finetune-ldm-lora.py
+++ b/scripts/training/finetune-ldm-lora.py
@ -0,0 +1,148 @@
+import random
+from typing import Any
+from pydantic import BaseModel
+from loguru import logger
+from refiners.adapters.lora import LoraAdapter, Lora
+from refiners.fluxion.utils import save_to_safetensors
+from refiners.foundationals.latent_diffusion.lora import LoraTarget
+import refiners.fluxion.layers as fl
+from torch import Tensor
+from torch.utils.data import Dataset
+
+from refiners.training_utils.callback import Callback
+from refiners.training_utils.latent_diffusion import (
+    FinetuneLatentDiffusionConfig,
+    TextEmbeddingLatentsBatch,
+    TextEmbeddingLatentsDataset,
+    LatentDiffusionTrainer,
+    LatentDiffusionConfig,
+)
+
+
+class LoraConfig(BaseModel):
+    rank: int = 32
+    trigger_phrase: str = ""
+    use_only_trigger_probability: float = 0.0
+    unet_targets: list[LoraTarget]
+    text_encoder_targets: list[LoraTarget]
+    lda_targets: list[LoraTarget]
+
+    def apply_loras_to_target(self, module: fl.Chain, target: LoraTarget) -> None:
+        for layer in module.layers(layer_type=target.get_class()):
+            for linear, parent in layer.walk(fl.Linear):
+                adapter = LoraAdapter(
+                    target=linear,
+                    rank=self.rank,
+                    device=module.device,
+                    dtype=module.dtype,
+                )
+                adapter.inject(parent)
+                for linear in adapter.Lora.layers(fl.Linear):
+                    linear.requires_grad_(requires_grad=True)
+
+
+class TriggerPhraseDataset(TextEmbeddingLatentsDataset):
+    def __init__(
+        self,
+        trainer: "LoraLatentDiffusionTrainer",
+    ) -> None:
+        super().__init__(trainer=trainer)
+        self.trigger_phrase = trainer.config.lora.trigger_phrase
+        self.use_only_trigger_probability = trainer.config.lora.use_only_trigger_probability
+        logger.info(f"Trigger phrase: {self.trigger_phrase}")
+
+    def process_caption(self, caption: str) -> str:
+        caption = super().process_caption(caption=caption)
+        if self.trigger_phrase:
+            caption = (
+                f"{self.trigger_phrase} {caption}"
+                if random.random() < self.use_only_trigger_probability
+                else self.trigger_phrase
+            )
+        return caption
+
+
+class LoraLatentDiffusionConfig(FinetuneLatentDiffusionConfig):
+    latent_diffusion: LatentDiffusionConfig
+    lora: LoraConfig
+
+    def model_post_init(self, __context: Any) -> None:
+        """Pydantic v2 does post init differently, so we need to override this method too."""
+        logger.info("Freezing models to train only the loras.")
+        self.models["unet"].train = False
+        self.models["text_encoder"].train = False
+        self.models["lda"].train = False
+
+
+class LoraLatentDiffusionTrainer(LatentDiffusionTrainer[LoraLatentDiffusionConfig]):
+    def __init__(
+        self,
+        config: LoraLatentDiffusionConfig,
+        callbacks: "list[Callback[Any]] | None" = None,
+    ) -> None:
+        super().__init__(config=config, callbacks=callbacks)
+        self.callbacks.extend((LoadLoras(), SaveLoras()))
+
+    def load_dataset(self) -> Dataset[TextEmbeddingLatentsBatch]:
+        return TriggerPhraseDataset(trainer=self)
+
+
+class LoadLoras(Callback[LoraLatentDiffusionTrainer]):
+    def on_train_begin(self, trainer: LoraLatentDiffusionTrainer) -> None:
+        lora_config = trainer.config.lora
+        for target in lora_config.unet_targets:
+            lora_config.apply_loras_to_target(module=trainer.unet, target=target)
+        for target in lora_config.text_encoder_targets:
+            lora_config.apply_loras_to_target(module=trainer.text_encoder, target=target)
+        for target in lora_config.lda_targets:
+            lora_config.apply_loras_to_target(module=trainer.lda, target=target)
+
+
+class SaveLoras(Callback[LoraLatentDiffusionTrainer]):
+    def on_checkpoint_save(self, trainer: LoraLatentDiffusionTrainer) -> None:
+        lora_config = trainer.config.lora
+
+        def get_weight(linear: fl.Linear) -> Tensor:
+            assert linear.bias is None
+            return linear.state_dict()["weight"]
+
+        def build_loras_safetensors(module: fl.Chain, key_prefix: str) -> dict[str, Tensor]:
+            weights: list[Tensor] = []
+            for lora in module.layers(layer_type=Lora):
+                linears = list(lora.layers(fl.Linear))
+                assert len(linears) == 2
+                # See `load_lora_weights` in refiners.adapters.lora
+                weights.extend((get_weight(linears[1]), get_weight(linears[0])))  # aka (up_weight, down_weight)
+            return {f"{key_prefix}{i:03d}": w for i, w in enumerate(weights)}
+
+        tensors: dict[str, Tensor] = {}
+        metadata: dict[str, str] = {}
+
+        if lora_config.unet_targets:
+            tensors |= build_loras_safetensors(trainer.unet, key_prefix="unet.")
+            metadata |= {"unet_targets": ",".join(lora_config.unet_targets)}
+
+        if lora_config.text_encoder_targets:
+            tensors |= build_loras_safetensors(trainer.text_encoder, key_prefix="text_encoder.")
+            metadata |= {"text_encoder_targets": ",".join(lora_config.text_encoder_targets)}
+
+        if lora_config.lda_targets:
+            tensors |= build_loras_safetensors(trainer.lda, key_prefix="lda.")
+            metadata |= {"lda_targets": ",".join(lora_config.lda_targets)}
+
+        save_to_safetensors(
+            path=trainer.ensure_checkpoints_save_folder / f"step{trainer.clock.step}.safetensors",
+            tensors=tensors,
+            metadata=metadata,
+        )
+
+
+if __name__ == "__main__":
+    import sys
+
+    config_path = sys.argv[1]
+    config = LoraLatentDiffusionConfig.load_from_toml(
+        toml_path=config_path,
+    )
+    trainer = LoraLatentDiffusionTrainer(config=config)
+    trainer.train()
--- a/scripts/training/finetune-ldm.py
+++ b/scripts/training/finetune-ldm.py
@ -0,0 +1,11 @@
+from refiners.training_utils.latent_diffusion import FinetuneLatentDiffusionConfig, LatentDiffusionTrainer
+
+if __name__ == "__main__":
+    import sys
+
+    config_path = sys.argv[1]
+    config = FinetuneLatentDiffusionConfig.load_from_toml(
+        toml_path=config_path,
+    )
+    trainer = LatentDiffusionTrainer(config=config)
+    trainer.train()
--- a/src/refiners/init.py
+++ b/src/refiners/init.py
--- a/src/refiners/adapters/init.py
+++ b/src/refiners/adapters/init.py
--- a/src/refiners/adapters/adapter.py
+++ b/src/refiners/adapters/adapter.py
@ -0,0 +1,66 @@
+import contextlib
+import refiners.fluxion.layers as fl
+from typing import Any, Generic, TypeVar, Iterator
+
+
+T = TypeVar("T", bound=fl.Module)
+TAdapter = TypeVar("TAdapter", bound="Adapter[fl.Module]")
+
+
+class Adapter(Generic[T]):
+    # we store _target into a one element list to avoid pytorch thinking it is a submodule
+    _target: "list[T]"
+
+    def __init_subclass__(cls, **kwargs: Any) -> None:
+        super().__init_subclass__(**kwargs)
+        assert issubclass(cls, fl.Chain), f"Adapter {cls.__name__} must be a Chain"
+
+    @property
+    def target(self) -> T:
+        return self._target[0]
+
+    @contextlib.contextmanager
+    def setup_adapter(self, target: T) -> Iterator[None]:
+        assert isinstance(self, fl.Chain)
+        assert (not hasattr(self, "_modules")) or (
+            len(self) == 0
+        ), "Call the Chain constructor in the setup_adapter context."
+        self._target = [target]
+
+        if not isinstance(self.target, fl.ContextModule):
+            yield
+            return
+
+        _old_can_refresh_parent = target._can_refresh_parent
+        target._can_refresh_parent = False
+        yield
+        target._can_refresh_parent = _old_can_refresh_parent
+
+    def inject(self, parent: fl.Chain | None = None) -> None:
+        assert isinstance(self, fl.Chain)
+
+        if parent is None:
+            if isinstance(self.target, fl.ContextModule):
+                parent = self.target.parent
+            else:
+                raise ValueError(f"parent of {self.target} is mandatory")
+        assert isinstance(parent, fl.Chain), f"{self.target} has invalid parent {parent}"
+        if self.target not in iter(parent):
+            raise ValueError(f"{self.target} is not in {parent}")
+
+        parent.replace(
+            old_module=self.target,
+            new_module=self,
+            old_module_parent=self.find_parent(self.target),
+        )
+
+    def eject(self) -> None:
+        assert isinstance(self, fl.Chain)
+        self.ensure_parent.replace(old_module=self, new_module=self.target)
+
+    def _pre_structural_copy(self) -> None:
+        if isinstance(self.target, fl.Chain):
+            raise RuntimeError("Chain adapters typically cannot be copied, eject them first.")
+
+    def _post_structural_copy(self: TAdapter, source: TAdapter) -> None:
+        self._target = [source.target]
--- a/src/refiners/adapters/lora.py
+++ b/src/refiners/adapters/lora.py
@ -0,0 +1,88 @@
+import refiners.fluxion.layers as fl
+from refiners.adapters.adapter import Adapter
+
+from torch.nn.init import zeros_, normal_
+from torch import Tensor, device as Device, dtype as DType
+
+
+class Lora(fl.Chain):
+    structural_attrs = ["in_features", "out_features", "rank", "scale"]
+
+    def __init__(
+        self,
+        in_features: int,
+        out_features: int,
+        rank: int = 16,
+        device: Device | str | None = None,
+        dtype: DType | None = None,
+    ) -> None:
+        self.in_features = in_features
+        self.out_features = out_features
+        self.rank = rank
+        self.scale: float = 1.0
+
+        super().__init__(
+            fl.Linear(in_features=in_features, out_features=rank, bias=False, device=device, dtype=dtype),
+            fl.Linear(in_features=rank, out_features=out_features, bias=False, device=device, dtype=dtype),
+            fl.Lambda(func=self.scale_outputs),
+        )
+
+        normal_(tensor=self.Linear_1.weight, std=1 / self.rank)
+        zeros_(tensor=self.Linear_2.weight)
+
+    def scale_outputs(self, x: Tensor) -> Tensor:
+        return x * self.scale
+
+    def set_scale(self, scale: float) -> None:
+        self.scale = scale
+
+    def load_weights(self, down_weight: Tensor, up_weight: Tensor) -> None:
+        self.Linear_1.weight = down_weight
+        self.Linear_2.weight = up_weight
+
+
+class LoraAdapter(fl.Sum, Adapter[fl.Linear]):
+    structural_attrs = ["in_features", "out_features", "rank", "scale"]
+
+    def __init__(
+        self,
+        target: fl.Linear,
+        rank: int = 16,
+        scale: float = 1.0,
+        device: Device | str | None = None,
+        dtype: DType | None = None,
+    ) -> None:
+        self.in_features = target.in_features
+        self.out_features = target.out_features
+        self.rank = rank
+        self.scale = scale
+        with self.setup_adapter(target):
+            super().__init__(
+                target,
+                Lora(
+                    in_features=target.in_features,
+                    out_features=target.out_features,
+                    rank=rank,
+                    device=device,
+                    dtype=dtype,
+                ),
+            )
+        self.Lora.set_scale(scale=scale)
+
+    def add_lora(self, lora: Lora) -> None:
+        self.append(module=lora)
+
+    def load_lora_weights(self, up_weight: Tensor, down_weight: Tensor, index: int = 0) -> None:
+        self[index + 1].load_weights(up_weight=up_weight, down_weight=down_weight)
+
+
+def load_lora_weights(model: fl.Chain, weights: list[Tensor]) -> None:
+    assert len(weights) % 2 == 0, "Number of weights must be even"
+    assert (
+        len(list(model.layers(layer_type=Lora))) == len(weights) // 2
+    ), "Number of Lora layers must match number of weights"
+    for i, lora in enumerate(iterable=model.layers(layer_type=Lora)):
+        assert (
+            lora.rank == weights[i * 2].shape[1]
+        ), f"Rank of Lora layer {lora.rank} must match shape of weights {weights[i*2].shape[1]}"
+        lora.load_weights(up_weight=weights[i * 2], down_weight=weights[i * 2 + 1])
--- a/src/refiners/adapters/range_adapter.py
+++ b/src/refiners/adapters/range_adapter.py
@ -0,0 +1,70 @@
+import math
+from torch import Tensor, arange, float32, exp, sin, cat, cos, device as Device, dtype as DType
+from jaxtyping import Float, Int
+
+from refiners.adapters.adapter import Adapter
+import refiners.fluxion.layers as fl
+
+
+def compute_sinusoidal_embedding(
+    x: Int[Tensor, "*batch 1"],
+    embedding_dim: int,
+) -> Float[Tensor, "*batch 1 embedding_dim"]:
+    half_dim = embedding_dim // 2
+    # Note: it is important that this computation is done in float32.
+    # The result can be cast to lower precision later if necessary.
+    exponent = -math.log(10000) * arange(start=0, end=half_dim, dtype=float32, device=x.device)
+    exponent /= half_dim
+    embedding = x.unsqueeze(1).float() * exp(exponent).unsqueeze(0)
+    embedding = cat([cos(embedding), sin(embedding)], dim=-1)
+    return embedding
+
+
+class RangeEncoder(fl.Chain):
+    structural_attrs = ["sinuosidal_embedding_dim", "embedding_dim"]
+
+    def __init__(
+        self,
+        sinuosidal_embedding_dim: int,
+        embedding_dim: int,
+        device: Device | str | None = None,
+        dtype: DType | None = None,
+    ) -> None:
+        self.sinuosidal_embedding_dim = sinuosidal_embedding_dim
+        self.embedding_dim = embedding_dim
+        super().__init__(
+            fl.Lambda(self.compute_sinuosoidal_embedding),
+            fl.Linear(in_features=sinuosidal_embedding_dim, out_features=embedding_dim, device=device, dtype=dtype),
+            fl.SiLU(),
+            fl.Linear(in_features=embedding_dim, out_features=embedding_dim, device=device, dtype=dtype),
+        )
+
+    def compute_sinuosoidal_embedding(self, x: Int[Tensor, "*batch 1"]) -> Float[Tensor, "*batch 1 embedding_dim"]:
+        return compute_sinusoidal_embedding(x, embedding_dim=self.sinuosidal_embedding_dim).to(self.dtype)
+
+
+class RangeAdapter2d(fl.Sum, Adapter[fl.Conv2d]):
+    structural_attrs = ["channels", "embedding_dim", "context_key"]
+
+    def __init__(
+        self,
+        target: fl.Conv2d,
+        channels: int,
+        embedding_dim: int,
+        context_key: str,
+        device: Device | str | None = None,
+        dtype: DType | None = None,
+    ) -> None:
+        self.channels = channels
+        self.embedding_dim = embedding_dim
+        self.context_key = context_key
+        with self.setup_adapter(target):
+            super().__init__(
+                target,
+                fl.Chain(
+                    fl.UseContext("range_adapter", context_key),
+                    fl.SiLU(),
+                    fl.Linear(in_features=embedding_dim, out_features=channels, device=device, dtype=dtype),
+                    fl.View(-1, channels, 1, 1),
+                ),
+            )
--- a/src/refiners/fluxion/init.py
+++ b/src/refiners/fluxion/init.py
@ -0,0 +1,3 @@
+from refiners.fluxion.utils import save_to_safetensors, load_from_safetensors, norm, manual_seed, pad
+
+__all__ = ["norm", "manual_seed", "save_to_safetensors", "load_from_safetensors", "pad"]
--- a/src/refiners/fluxion/context.py
+++ b/src/refiners/fluxion/context.py
@ -0,0 +1,52 @@
+from typing import Any
+from torch import Tensor
+
+Context = dict[str, Any]
+Contexts = dict[str, Context]
+
+
+class ContextProvider:
+    def __init__(self) -> None:
+        self.contexts: Contexts = {}
+
+    def set_context(self, key: str, value: Context) -> None:
+        self.contexts[key] = value
+
+    def get_context(self, key: str) -> Any:
+        return self.contexts.get(key)
+
+    def update_contexts(self, new_contexts: Contexts) -> None:
+        for key, value in new_contexts.items():
+            if key not in self.contexts:
+                self.contexts[key] = value
+            else:
+                self.contexts[key].update(value)
+
+    @staticmethod
+    def create(contexts: Contexts) -> "ContextProvider":
+        provider = ContextProvider()
+        provider.update_contexts(contexts)
+        return provider
+
+    def __add__(self, other: "ContextProvider") -> "ContextProvider":
+        self.contexts.update(other.contexts)
+        return self
+
+    def __lshift__(self, other: "ContextProvider") -> "ContextProvider":
+        other.contexts.update(self.contexts)
+        return other
+
+    def __bool__(self) -> bool:
+        return bool(self.contexts)
+
+    def _get_repr_for_value(self, value: Any) -> str:
+        if isinstance(value, Tensor):
+            return f"Tensor(shape={value.shape}, dtype={value.dtype}, device={value.device})"
+        return repr(value)
+
+    def _get_repr_for_dict(self, context_dict: Context) -> dict[str, str]:
+        return {key: self._get_repr_for_value(value) for key, value in context_dict.items()}
+
+    def __repr__(self) -> str:
+        contexts_repr = {key: self._get_repr_for_dict(value) for key, value in self.contexts.items()}
+        return f"{self.__class__.__name__}(contexts={contexts_repr})"
--- a/src/refiners/fluxion/layers/init.py
+++ b/src/refiners/fluxion/layers/init.py
@ -0,0 +1,82 @@
+from refiners.fluxion.layers.activations import GLU, SiLU, ReLU, ApproximateGeLU, GeLU
+from refiners.fluxion.layers.norm import LayerNorm, GroupNorm, LayerNorm2d
+from refiners.fluxion.layers.attentions import Attention, SelfAttention, SelfAttention2d
+from refiners.fluxion.layers.basics import (
+    Identity,
+    View,
+    Flatten,
+    Unflatten,
+    Transpose,
+    Permute,
+    Reshape,
+    Squeeze,
+    Unsqueeze,
+    Slicing,
+    Parameter,
+    Buffer,
+)
+from refiners.fluxion.layers.chain import (
+    Lambda,
+    Sum,
+    Residual,
+    Return,
+    Chain,
+    UseContext,
+    SetContext,
+    Parallel,
+    Passthrough,
+    Breakpoint,
+    Concatenate,
+)
+from refiners.fluxion.layers.conv import Conv2d
+from refiners.fluxion.layers.linear import Linear, MultiLinear
+from refiners.fluxion.layers.module import Module, WeightedModule, ContextModule
+from refiners.fluxion.layers.sampling import Downsample, Upsample, Interpolate
+from refiners.fluxion.layers.embedding import Embedding
+
+__all__ = [
+    "Embedding",
+    "LayerNorm",
+    "GroupNorm",
+    "LayerNorm2d",
+    "GeLU",
+    "GLU",
+    "SiLU",
+    "ReLU",
+    "ApproximateGeLU",
+    "Attention",
+    "SelfAttention",
+    "SelfAttention2d",
+    "Identity",
+    "View",
+    "Flatten",
+    "Unflatten",
+    "Transpose",
+    "Permute",
+    "Squeeze",
+    "Unsqueeze",
+    "Reshape",
+    "Slicing",
+    "Parameter",
+    "Buffer",
+    "Lambda",
+    "Return",
+    "Sum",
+    "Residual",
+    "Chain",
+    "UseContext",
+    "SetContext",
+    "Parallel",
+    "Passthrough",
+    "Breakpoint",
+    "Concatenate",
+    "Conv2d",
+    "Linear",
+    "MultiLinear",
+    "Downsample",
+    "Upsample",
+    "Module",
+    "WeightedModule",
+    "ContextModule",
+    "Interpolate",
+]
--- a/src/refiners/fluxion/layers/activations.py
+++ b/src/refiners/fluxion/layers/activations.py
@ -0,0 +1,66 @@
+from refiners.fluxion.layers.module import Module
+from torch.nn.functional import silu
+from torch import Tensor, sigmoid
+from torch.nn.functional import gelu  # type: ignore
+
+
+class Activation(Module):
+    def __init__(self) -> None:
+        super().__init__()
+
+
+class SiLU(Activation):
+    def __init__(self) -> None:
+        super().__init__()
+
+    def forward(self, x: Tensor) -> Tensor:
+        return silu(x)  # type: ignore
+
+
+class ReLU(Activation):
+    def __init__(self) -> None:
+        super().__init__()
+
+    def forward(self, x: Tensor) -> Tensor:
+        return x.relu()
+
+
+class GeLU(Activation):
+    def __init__(self) -> None:
+        super().__init__()
+
+    def forward(self, x: Tensor) -> Tensor:
+        return gelu(x)  # type: ignore
+
+
+class ApproximateGeLU(Activation):
+    """
+    The approximate form of Gaussian Error Linear Unit (GELU)
+    For more details, see section 2: https://arxiv.org/abs/1606.08415
+    """
+
+    def __init__(self) -> None:
+        super().__init__()
+
+    def forward(self, x: Tensor) -> Tensor:
+        return x * sigmoid(1.702 * x)
+
+
+class GLU(Activation):
+    """
+    Gated Linear Unit activation layer.
+
+    See https://arxiv.org/abs/2002.05202v1 for details.
+    """
+
+    def __init__(self, activation: Activation) -> None:
+        super().__init__()
+        self.activation = activation
+
+    def __repr__(self):
+        return f"{self.__class__.__name__}(activation={self.activation})"
+
+    def forward(self, x: Tensor) -> Tensor:
+        assert x.shape[-1] % 2 == 0, "Non-batch input dimension must be divisible by 2"
+        output, gate = x.chunk(2, dim=-1)
+        return output * self.activation(gate)
--- a/src/refiners/fluxion/layers/attentions.py
+++ b/src/refiners/fluxion/layers/attentions.py
@ -0,0 +1,189 @@
+from jaxtyping import Float
+from torch.nn.functional import scaled_dot_product_attention as _scaled_dot_product_attention  # type: ignore
+from torch import Tensor, device as Device, dtype as DType
+
+from refiners.fluxion.layers.linear import Linear
+from refiners.fluxion.layers.module import Module
+from refiners.fluxion.layers.chain import Chain, Distribute, Parallel, Lambda
+from refiners.fluxion.layers.basics import Identity
+from refiners.fluxion.context import Contexts
+
+
+def scaled_dot_product_attention(
+    query: Float[Tensor, "batch source_sequence_length dim"],
+    key: Float[Tensor, "batch target_sequence_length dim"],
+    value: Float[Tensor, "batch target_sequence_length dim"],
+    is_causal: bool = False,
+) -> Float[Tensor, "batch source_sequence_length dim"]:
+    return _scaled_dot_product_attention(query, key, value, is_causal=is_causal)  # type: ignore
+
+
+class ScaledDotProductAttention(Module):
+    def __init__(self, num_heads: int = 1, is_causal: bool | None = None) -> None:
+        super().__init__()
+        self.num_heads = num_heads
+        self.is_causal = is_causal
+
+    def forward(
+        self,
+        query: Float[Tensor, "batch num_queries embedding_dim"],
+        key: Float[Tensor, "batch num_keys embedding_dim"],
+        value: Float[Tensor, "batch num_values embedding_dim"],
+        is_causal: bool | None = None,
+    ) -> Float[Tensor, "batch num_queries dim"]:
+        return self.merge_multi_head(
+            scaled_dot_product_attention(
+                query=self.split_to_multi_head(query),
+                key=self.split_to_multi_head(key),
+                value=self.split_to_multi_head(value),
+                is_causal=(
+                    is_causal if is_causal is not None else (self.is_causal if self.is_causal is not None else False)
+                ),
+            )
+        )
+
+    def split_to_multi_head(
+        self, x: Float[Tensor, "batch_size sequence_length embedding_dim"]
+    ) -> Float[Tensor, "batch_size num_heads sequence_length (embedding_dim//num_heads)"]:
+        assert (
+            len(x.shape) == 3
+        ), f"Expected tensor with shape (batch_size sequence_length embedding_dim), got {x.shape}"
+        assert (
+            x.shape[-1] % self.num_heads == 0
+        ), f"Embedding dim (x.shape[-1]={x.shape[-1]}) must be divisible by num heads"
+        return x.reshape(x.shape[0], x.shape[1], self.num_heads, x.shape[-1] // self.num_heads).transpose(1, 2)
+
+    def merge_multi_head(
+        self, x: Float[Tensor, "batch_size num_heads sequence_length heads_dim"]
+    ) -> Float[Tensor, "batch_size sequence_length heads_dim * num_heads"]:
+        return x.transpose(1, 2).reshape(x.shape[0], x.shape[2], self.num_heads * x.shape[-1])
+
+
+class Attention(Chain):
+    structural_attrs = [
+        "embedding_dim",
+        "num_heads",
+        "heads_dim",
+        "key_embedding_dim",
+        "value_embedding_dim",
+        "use_bias",
+        "is_causal",
+    ]
+
+    def __init__(
+        self,
+        embedding_dim: int,
+        num_heads: int = 1,
+        key_embedding_dim: int | None = None,
+        value_embedding_dim: int | None = None,
+        use_bias: bool = True,
+        is_causal: bool | None = None,
+        device: Device | str | None = None,
+        dtype: DType | None = None,
+    ) -> None:
+        assert (
+            embedding_dim % num_heads == 0
+        ), f"embedding_dim {embedding_dim} must be divisible by num_heads {num_heads}"
+        self.embedding_dim = embedding_dim
+        self.num_heads = num_heads
+        self.heads_dim = embedding_dim // num_heads
+        self.key_embedding_dim = key_embedding_dim or embedding_dim
+        self.value_embedding_dim = value_embedding_dim or embedding_dim
+        self.use_bias = use_bias
+        self.is_causal = is_causal
+        super().__init__(
+            Distribute(
+                Linear(
+                    in_features=self.embedding_dim,
+                    out_features=self.embedding_dim,
+                    bias=self.use_bias,
+                    device=device,
+                    dtype=dtype,
+                ),
+                Linear(
+                    in_features=self.key_embedding_dim,
+                    out_features=self.embedding_dim,
+                    bias=self.use_bias,
+                    device=device,
+                    dtype=dtype,
+                ),
+                Linear(
+                    in_features=self.value_embedding_dim,
+                    out_features=self.embedding_dim,
+                    bias=self.use_bias,
+                    device=device,
+                    dtype=dtype,
+                ),
+            ),
+            ScaledDotProductAttention(num_heads=num_heads, is_causal=is_causal),
+            Linear(
+                in_features=self.embedding_dim,
+                out_features=self.embedding_dim,
+                bias=True,
+                device=device,
+                dtype=dtype,
+            ),
+        )
+
+
+class SelfAttention(Attention):
+    def __init__(
+        self,
+        embedding_dim: int,
+        num_heads: int = 1,
+        use_bias: bool = True,
+        is_causal: bool | None = None,
+        device: Device | str | None = None,
+        dtype: DType | None = None,
+    ) -> None:
+        super().__init__(
+            embedding_dim=embedding_dim,
+            num_heads=num_heads,
+            use_bias=use_bias,
+            is_causal=is_causal,
+            device=device,
+            dtype=dtype,
+        )
+        self.insert(0, Parallel(Identity(), Identity(), Identity()))
+
+
+class SelfAttention2d(SelfAttention):
+    structural_attrs = ["channels"]
+
+    def __init__(
+        self,
+        channels: int,
+        num_heads: int = 1,
+        use_bias: bool = True,
+        is_causal: bool | None = None,
+        device: Device | str | None = None,
+        dtype: DType | None = None,
+    ) -> None:
+        assert channels % num_heads == 0, f"channels {channels} must be divisible by num_heads {num_heads}"
+        self.channels = channels
+        super().__init__(
+            embedding_dim=channels,
+            num_heads=num_heads,
+            use_bias=use_bias,
+            is_causal=is_causal,
+            device=device,
+            dtype=dtype,
+        )
+        self.insert(0, Lambda(self.tensor_2d_to_sequence))
+        self.append(Lambda(self.sequence_to_tensor_2d))
+
+    def init_context(self) -> Contexts:
+        return {"reshape": {"height": None, "width": None}}
+
+    def tensor_2d_to_sequence(
+        self, x: Float[Tensor, "batch channels height width"]
+    ) -> Float[Tensor, "batch height*width channels"]:
+        height, width = x.shape[-2:]
+        self.set_context(context="reshape", value={"height": height, "width": width})
+        return x.reshape(x.shape[0], x.shape[1], height * width).transpose(1, 2)
+
+    def sequence_to_tensor_2d(
+        self, x: Float[Tensor, "batch sequence_length channels"]
+    ) -> Float[Tensor, "batch channels height width"]:
+        height, width = self.use_context("reshape").values()
+        return x.transpose(1, 2).reshape(x.shape[0], x.shape[2], height, width)
--- a/src/refiners/fluxion/layers/basics.py
+++ b/src/refiners/fluxion/layers/basics.py
@ -0,0 +1,183 @@
+from refiners.fluxion.layers.module import Module, WeightedModule
+from torch import randn, Tensor, Size, device as Device, dtype as DType
+from torch.nn import Parameter as TorchParameter
+
+
+class Identity(Module):
+    def __init__(self) -> None:
+        super().__init__()
+
+    def forward(self, x: Tensor) -> Tensor:
+        return x
+
+
+class View(Module):
+    def __init__(self, *shape: int) -> None:
+        super().__init__()
+        self.shape = shape
+
+    def forward(self, x: Tensor) -> Tensor:
+        return x.view(*self.shape)
+
+    def __repr__(self):
+        shape_repr = ", ".join([repr(s) for s in self.shape])
+        return f"{self.__class__.__name__}({shape_repr})"
+
+
+class Flatten(Module):
+    def __init__(self, start_dim: int = 0, end_dim: int = -1) -> None:
+        super().__init__()
+        self.start_dim = start_dim
+        self.end_dim = end_dim
+
+    def forward(self, x: Tensor) -> Tensor:
+        return x.flatten(self.start_dim, self.end_dim)
+
+    def __repr__(self):
+        return f"{self.__class__.__name__}(start_dim={repr(self.start_dim)}, end_dim={repr(self.end_dim)})"
+
+
+class Unflatten(Module):
+    def __init__(self, dim: int) -> None:
+        super().__init__()
+        self.dim = dim
+
+    def forward(self, x: Tensor, sizes: Size) -> Tensor:
+        return x.unflatten(self.dim, sizes)  # type: ignore
+
+    def __repr__(self):
+        return f"{self.__class__.__name__}(dim={repr(self.dim)})"
+
+
+class Reshape(Module):
+    """
+    Reshape the input tensor to the given shape. The shape must be compatible with the input tensor shape. The batch
+    dimension is preserved.
+    """
+
+    def __init__(self, *shape: int) -> None:
+        super().__init__()
+        self.shape = shape
+
+    def forward(self, x: Tensor) -> Tensor:
+        return x.reshape(x.shape[0], *self.shape)
+
+    def __repr__(self):
+        shape_repr = ", ".join([repr(s) for s in self.shape])
+        return f"{self.__class__.__name__}({shape_repr})"
+
+
+class Transpose(Module):
+    def __init__(self, dim0: int, dim1: int) -> None:
+        super().__init__()
+        self.dim0 = dim0
+        self.dim1 = dim1
+
+    def forward(self, x: Tensor) -> Tensor:
+        return x.transpose(self.dim0, self.dim1)
+
+    def __repr__(self):
+        return f"{self.__class__.__name__}(dim0={repr(self.dim0)}, dim1={repr(self.dim1)})"
+
+
+class Permute(Module):
+    def __init__(self, *dims: int) -> None:
+        super().__init__()
+        self.dims = dims
+
+    def forward(self, x: Tensor) -> Tensor:
+        return x.permute(*self.dims)
+
+    def __repr__(self):
+        dims_repr = ", ".join([repr(d) for d in self.dims])
+        return f"{self.__class__.__name__}({dims_repr})"
+
+
+class Slicing(Module):
+    def __init__(self, dim: int, start: int, length: int) -> None:
+        super().__init__()
+        self.dim = dim
+        self.start = start
+        self.length = length
+
+    def forward(self, x: Tensor) -> Tensor:
+        return x.narrow(self.dim, self.start, self.length)
+
+    def __repr__(self):
+        return f"{self.__class__.__name__}(dim={repr(self.dim)}, start={repr(self.start)}, length={repr(self.length)})"
+
+
+class Squeeze(Module):
+    def __init__(self, dim: int) -> None:
+        super().__init__()
+        self.dim = dim
+
+    def forward(self, x: Tensor) -> Tensor:
+        return x.squeeze(self.dim)
+
+    def __repr__(self):
+        return f"{self.__class__.__name__}(dim={repr(self.dim)})"
+
+
+class Unsqueeze(Module):
+    def __init__(self, dim: int) -> None:
+        super().__init__()
+        self.dim = dim
+
+    def forward(self, x: Tensor) -> Tensor:
+        return x.unsqueeze(self.dim)
+
+    def __repr__(self):
+        return f"{self.__class__.__name__}(dim={repr(self.dim)})"
+
+
+class Parameter(WeightedModule):
+    """
+    A layer that wraps a tensor as a parameter. This is useful to create a parameter that is not a weight or a bias.
+    """
+
+    def __init__(self, *dims: int, device: Device | str | None = None, dtype: DType | None = None) -> None:
+        super().__init__()
+        self.register_parameter("parameter", TorchParameter(randn(*dims, device=device, dtype=dtype)))
+
+    @property
+    def device(self) -> Device:
+        return self.parameter.device
+
+    @property
+    def dtype(self) -> DType:
+        return self.parameter.dtype
+
+    def forward(self, _: Tensor) -> Tensor:
+        return self.parameter
+
+    def __repr__(self):
+        dims_repr = ", ".join([repr(d) for d in list(self.parameter.shape)])
+        return f"{self.__class__.__name__}({dims_repr}, device={repr(self.device)})"
+
+
+class Buffer(WeightedModule):
+    """
+    A layer that wraps a tensor as a buffer. This is useful to create a buffer that is not a weight or a bias.
+
+    Buffers are not trainable.
+    """
+
+    def __init__(self, *dims: int, device: Device | str | None = None, dtype: DType | None = None) -> None:
+        super().__init__()
+        self.register_buffer("buffer", randn(*dims, device=device, dtype=dtype))
+
+    @property
+    def device(self) -> Device:
+        return self.buffer.device
+
+    @property
+    def dtype(self) -> DType:
+        return self.buffer.dtype
+
+    def forward(self, _: Tensor) -> Tensor:
+        return self.buffer
+
+    def __repr__(self):
+        dims_repr = ", ".join([repr(d) for d in list(self.buffer.shape)])
+        return f"{self.__class__.__name__}({dims_repr}, device={repr(self.device)})"
--- a/src/refiners/fluxion/layers/chain.py
+++ b/src/refiners/fluxion/layers/chain.py
@ -0,0 +1,466 @@
+import inspect
+from typing import Any, Callable, Iterable, Iterator, TypeVar, cast, overload
+from torch import Tensor, cat, device as Device, dtype as DType
+from refiners.fluxion.layers.basics import Identity
+from refiners.fluxion.layers.module import Module, ContextModule, WeightedModule
+from refiners.fluxion.context import Contexts, ContextProvider
+
+
+T = TypeVar("T", bound=Module)
+TChain = TypeVar("TChain", bound="Chain")  # because Self (PEP 673) is not in 3.10
+
+
+class Lambda(Module):
+    """Lambda is a wrapper around a callable object that allows it to be used as a PyTorch module."""
+
+    def __init__(self, func: Callable[..., Any]) -> None:
+        super().__init__()
+        self.func = func
+
+    def forward(self, *args: Any) -> Any:
+        return self.func(*args)
+
+    def __repr__(self):
+        func_name = getattr(self.func, "__name__", "partial_function")
+        return f"Lambda({func_name}{str(inspect.signature(self.func))})"
+
+
+def generate_unique_names(
+    modules: tuple[Module, ...],
+) -> dict[str, Module]:
+    class_counts: dict[str, int] = {}
+    unique_names: list[tuple[str, Module]] = []
+    for module in modules:
+        class_name = module.__class__.__name__
+        class_counts[class_name] = class_counts.get(class_name, 0) + 1
+    name_counter: dict[str, int] = {}
+    for module in modules:
+        class_name = module.__class__.__name__
+        name_counter[class_name] = name_counter.get(class_name, 0) + 1
+        unique_name = f"{class_name}_{name_counter[class_name]}" if class_counts[class_name] > 1 else class_name
+        unique_names.append((unique_name, module))
+    return dict(unique_names)
+
+
+class UseContext(ContextModule):
+    structural_attrs = ["context", "key", "func"]
+
+    def __init__(self, context: str, key: str) -> None:
+        super().__init__()
+        self.context = context
+        self.key = key
+        self.func: Callable[[Any], Any] = lambda x: x
+
+    def __call__(self, *args: Any) -> Any:
+        context = self.use_context(self.context)
+        assert context, f"context {self.context} is unset"
+        value = context.get(self.key)
+        assert value is not None, f"context entry {self.context}.{self.key} is unset"
+        return self.func(value)
+
+    def __repr__(self):
+        return f"{self.__class__.__name__}(context={repr(self.context)}, key={repr(self.key)})"
+
+    def compose(self, func: Callable[[Any], Any]) -> "UseContext":
+        self.func = func
+        return self
+
+
+class SetContext(ContextModule):
+    """A Module that sets a context value when executed.
+
+    The context need to pre exist in the context provider.
+    #TODO Is there a way to create the context if it doesn't exist?
+    """
+
+    structural_attrs = ["context", "key", "callback"]
+
+    def __init__(self, context: str, key: str, callback: Callable[[Any, Any], Any] | None = None) -> None:
+        super().__init__()
+        self.context = context
+        self.key = key
+        self.callback = callback
+
+    def __call__(self, x: Tensor) -> Tensor:
+        if context := self.use_context(self.context):
+            if not self.callback:
+                context.update({self.key: x})
+            else:
+                self.callback(context[self.key], x)
+
+        return x
+
+    def __repr__(self):
+        return f"{self.__class__.__name__}(context={repr(self.context)}, key={repr(self.key)})"
+
+
+class ReturnException(Exception):
+    """Exception raised when a Return module is encountered."""
+
+    def __init__(self, value: Tensor):
+        self.value = value
+
+
+class Return(Module):
+    """A Module that stops the execution of a Chain when encountered."""
+
+    def forward(self, x: Tensor):
+        raise ReturnException(x)
+
+
+def structural_copy(m: T) -> T:
+    return m.structural_copy() if isinstance(m, ContextModule) else m
+
+
+class Chain(ContextModule):
+    _modules: dict[str, Module]
+    _provider: ContextProvider
+
+    def __init__(self, *args: Module | Iterable[Module]) -> None:
+        super().__init__()
+        self._provider = ContextProvider()
+        modules = cast(
+            tuple[Module],
+            (
+                tuple(args[0])
+                if len(args) == 1 and isinstance(args[0], Iterable) and not isinstance(args[0], Chain)
+                else tuple(args)
+            ),
+        )
+
+        for module in modules:
+            # Violating this would mean a ContextModule ends up in two chains,
+            # with a single one correctly set as its parent.
+            assert (
+                (not isinstance(module, ContextModule))
+                or (not module._can_refresh_parent)
+                or (module.parent is None)
+                or (module.parent == self)
+            ), f"{module.__class__.__name__} already has parent {module.parent.__class__.__name__}"
+
+        self._regenerate_keys(modules)
+        self._reset_context()
+
+        for module in self:
+            if isinstance(module, ContextModule) and module._can_refresh_parent and module.parent != self:
+                module._set_parent(self)
+
+    @property
+    def provider(self) -> ContextProvider:
+        return self._provider
+
+    def init_context(self) -> Contexts:
+        return {}
+
+    def _register_provider(self, context: Contexts | None = None) -> None:
+        if context:
+            self._provider.update_contexts(context)
+
+        for module in self:
+            if isinstance(module, Chain):
+                module._register_provider(context=self._provider.contexts)
+
+    def _reset_context(self) -> None:
+        self._register_provider(self.init_context())
+
+    def set_context(self, context: str, value: Any) -> None:
+        self._provider.set_context(context, value)
+        self._register_provider()
+
+    def debug_repr(self, layer_name: str = "") -> str:
+        lines: list[str] = []
+        tab = "  "
+        tab_length = 0
+        for i, parent in enumerate(self.get_parents()[::-1]):
+            lines.append(f"{tab*tab_length}{'└─ ' if i else ''}{parent.__class__.__name__}")
+            tab_length += 1
+
+        lines.append(f"{tab*tab_length}└─ {self.__class__.__name__}")
+
+        for name, _ in self._modules.items():
+            error_arrow = "⚠️" if name == layer_name else ""
+            lines.append(f"{tab*tab_length} | {name} {error_arrow}")
+
+        return "\n".join(lines)
+
+    def call_layer(self, layer: Module, layer_name: str, *args: Any):
+        try:
+            return layer(*args)
+        except Exception as e:
+            pretty_print = self.debug_repr(layer_name)
+            raise ValueError(f"Error in layer {layer_name}, args:\n {args}\n \n{pretty_print}") from e
+
+    def forward(self, *args: Any) -> Any:
+        result: tuple[Any] | Any = None
+        intermediate_args: tuple[Any, ...] = args
+        for name, layer in self._modules.items():
+            result = self.call_layer(layer, name, *intermediate_args)
+            intermediate_args = (result,) if not isinstance(result, tuple) else result
+
+        self._reset_context()
+        return result
+
+    def _regenerate_keys(self, modules: Iterable[Module]) -> None:
+        self._modules = generate_unique_names(tuple(modules))  # type: ignore
+
+    def __add__(self, other: "Chain | Module | list[Module]") -> "Chain":
+        if isinstance(other, Module):
+            other = Chain(other)
+        if isinstance(other, list):
+            other = Chain(*other)
+        return Chain(*self, *other)
+
+    def __getitem__(self, key: int | str | slice) -> Module:
+        if isinstance(key, slice):
+            return Chain(*list(self)[key])
+        elif isinstance(key, str):
+            return self._modules[key]
+        else:
+            return list(self)[key]
+
+    def __iter__(self) -> Iterator[Module]:
+        return iter(self._modules.values())
+
+    def _pretty_print(self, num_tab: int = 0, layer_name: str | None = None) -> str:
+        layer_name = self.__class__.__name__ if layer_name is None else layer_name
+        pretty_print = f"{layer_name}:\n"
+        tab = " " * (num_tab + 4)
+        module_strings: list[str] = []
+        for i, (name, module) in enumerate(self._modules.items()):
+            ident = ("└+" if isinstance(self, Sum) else "└─") if i == 0 else "  "
+            module_str = (
+                module
+                if not isinstance(module, Chain)
+                else (module._pretty_print(len(tab), name) if num_tab < 12 else f"{name}(...)")
+            )
+            module_strings.append(f"{tab}{ident} {module_str}")
+        pretty_print += "\n".join(module_strings)
+        return pretty_print
+
+    def __repr__(self) -> str:
+        return self._pretty_print()
+
+    def __str__(self) -> str:
+        return f"<{self.__class__.__name__} at {hex(id(self))}>"
+
+    def __len__(self) -> int:
+        return len(self._modules)
+
+    @property
+    def device(self) -> Device | None:
+        wm = self.find(WeightedModule)
+        return None if wm is None else wm.device
+
+    @property
+    def dtype(self) -> DType | None:
+        wm = self.find(WeightedModule)
+        return None if wm is None else wm.dtype
+
+    def _walk(self, predicate: Callable[[Module, "Chain"], bool] | None = None) -> Iterator[tuple[Module, "Chain"]]:
+        if predicate is None:
+            predicate = lambda _m, _p: True
+        for module in self:
+            keep_going = True
+            try:
+                p = predicate(module, self)
+            except StopIteration:
+                p = False
+                keep_going = False
+            if p:
+                yield (module, self)
+            if keep_going and isinstance(module, Chain):
+                yield from module.walk(predicate)
+
+    @overload
+    def walk(self, predicate: Callable[[Module, "Chain"], bool] | None = None) -> Iterator[tuple[Module, "Chain"]]:
+        ...
+
+    @overload
+    def walk(self, predicate: type[T]) -> Iterator[tuple[T, "Chain"]]:
+        ...
+
+    def walk(
+        self, predicate: type[T] | Callable[[Module, "Chain"], bool] | None = None
+    ) -> Iterator[tuple[T, "Chain"]] | Iterator[tuple[Module, "Chain"]]:
+        if isinstance(predicate, type):
+            return self._walk(lambda m, _: isinstance(m, predicate))
+        else:
+            return self._walk(predicate)
+
+    def layers(self, layer_type: type[T]) -> Iterator[T]:
+        for module, _ in self.walk(layer_type):
+            yield module
+
+    def find(self, layer_type: type[T]) -> T | None:
+        return next(self.layers(layer_type=layer_type), None)
+
+    def find_parent(self, module: Module) -> "Chain | None":
+        if module in self:  # avoid DFS-crawling the whole tree
+            return self
+        for _, parent in self.walk(lambda m, _: m == module):
+            return parent
+        return None
+
+    def insert(self, index: int, module: Module) -> None:  # type: ignore
+        if index < 0:
+            index = max(0, len(self._modules) + index + 1)
+        modules = list(self)
+        modules.insert(index, module)
+        self._regenerate_keys(modules)
+        if isinstance(module, ContextModule):
+            module._set_parent(self)
+        self._register_provider()
+
+    def insert_after_type(self, module_type: type[Module], new_module: Module) -> None:
+        for i, module in enumerate(self):
+            if isinstance(module, module_type):
+                self.insert(i + 1, new_module)
+                return
+        raise ValueError(f"No module of type {module_type.__name__} found in the chain.")
+
+    def append(self, module: Module) -> None:  # type: ignore
+        modules = list(self)
+        modules.append(module)
+        self._regenerate_keys(modules)
+        if isinstance(module, ContextModule):
+            module._set_parent(self)
+        self._register_provider()
+
+    def pop(self, index: int = -1) -> Module | tuple[Module]:  # type: ignore
+        modules = list(self)
+        if index < 0:
+            index = len(modules) + index
+        if index < 0 or index >= len(modules):
+            raise IndexError("Index out of range.")
+        removed_module = modules.pop(index)
+        if isinstance(removed_module, ContextModule):
+            removed_module._set_parent(None)
+        self._regenerate_keys(modules)
+        return removed_module
+
+    def remove(self, module: Module) -> None:
+        """Remove a module from the chain."""
+        modules = list(self)
+        try:
+            modules.remove(module)
+        except ValueError:
+            raise ValueError(f"{module} is not in {self}")
+        self._regenerate_keys(modules)
+        if isinstance(module, ContextModule):
+            module._set_parent(None)
+
+    def replace(
+        self,
+        old_module: Module,
+        new_module: Module,
+        old_module_parent: "Chain | None" = None,
+    ) -> None:
+        """Replace a module in the chain with a new module."""
+        modules = list(self)
+        try:
+            modules[modules.index(old_module)] = new_module
+        except ValueError:
+            raise ValueError(f"{old_module} is not in {self}")
+        self._regenerate_keys(modules)
+        if isinstance(new_module, ContextModule):
+            new_module._set_parent(self)
+        if isinstance(old_module, ContextModule):
+            old_module._set_parent(old_module_parent)
+
+    def structural_copy(self: TChain) -> TChain:
+        """Copy the structure of the Chain tree.
+
+        This method returns a recursive copy of the Chain tree where all inner nodes
+        (instances of Chain and its subclasses) are duplicated and all leaves
+        (regular Modules) are not.
+
+        Such copies can be adapted without disrupting the base model, but do not
+        require extra GPU memory since the weights are in the leaves and hence not copied.
+
+        This assumes all subclasses define the class variable `structural_attrs` which
+        contains a list of basic attributes set in the constructor. In complicated cases
+        it may be required to overwrite that method.
+        """
+        if hasattr(self, "_pre_structural_copy"):
+            self._pre_structural_copy()
+
+        modules = [structural_copy(m) for m in self]
+
+        # Instantiate the right subclass, but do not initialize.
+        clone = object.__new__(self.__class__)
+
+        # Copy all basic attributes of the class declared in `structural_attrs`.
+        for k in self.__class__.structural_attrs:
+            setattr(clone, k, getattr(self, k))
+
+        # Call constructor of Chain, which among other things refreshes the context tree.
+        Chain.__init__(clone, *modules)
+
+        for module in modules:
+            if isinstance(module, ContextModule):
+                module._set_parent(clone)
+
+        if hasattr(clone, "_post_structural_copy"):
+            clone._post_structural_copy(self)
+
+        return clone
+
+
+class Parallel(Chain):
+    def forward(self, *args: Any) -> tuple[Tensor, ...]:
+        return tuple([self.call_layer(module, name, *args) for name, module in self._modules.items()])
+
+
+class Distribute(Chain):
+    def forward(self, *args: Any) -> tuple[Tensor, ...]:
+        assert len(args) == len(self._modules), "Number of positional arguments must match number of sub-modules."
+        return tuple([self.call_layer(module, name, arg) for arg, (name, module) in zip(args, self._modules.items())])
+
+
+class Passthrough(Chain):
+    def forward(self, *inputs: Any) -> Any:
+        super().forward(*inputs)
+        return inputs
+
+
+class Sum(Chain):
+    def forward(self, *inputs: Any) -> Any:
+        output = None
+        for layer in self:
+            layer_output: Any = layer(*inputs)
+            if isinstance(layer_output, tuple):
+                layer_output = sum(layer_output)  # type: ignore
+            output = layer_output if output is None else output + layer_output
+        return output
+
+
+class Residual(Sum):
+    def __init__(self, *modules: Module) -> None:
+        super().__init__(Identity(), Chain(*modules))
+
+
+class Breakpoint(Module):
+    def __init__(self, vscode: bool = True):
+        super().__init__()
+        self.vscode = vscode
+
+    def forward(self, *args: Any):
+        if self.vscode:
+            import debugpy  # type: ignore
+
+            debugpy.breakpoint()  # type: ignore
+        else:
+            breakpoint()
+        return args[0] if len(args) == 1 else args
+
+
+class Concatenate(Chain):
+    structural_attrs = ["dim"]
+
+    def __init__(self, *modules: Module, dim: int = 0) -> None:
+        super().__init__(*modules)
+        self.dim = dim
+
+    def forward(self, *args: Any) -> Tensor:
+        outputs = [module(*args) for module in self]
+        return cat([output for output in outputs if output is not None], dim=self.dim)
--- a/src/refiners/fluxion/layers/conv.py
+++ b/src/refiners/fluxion/layers/conv.py
@ -0,0 +1,73 @@
+from torch.nn import Conv2d as _Conv2d, Conv1d as _Conv1d
+from torch import device as Device, dtype as DType
+from refiners.fluxion.layers.module import WeightedModule
+
+
+class Conv2d(_Conv2d, WeightedModule):
+    def __init__(
+        self,
+        in_channels: int,
+        out_channels: int,
+        kernel_size: int | tuple[int, ...],
+        stride: int | tuple[int, ...] = 1,
+        padding: int | tuple[int, ...] | str = 0,
+        dilation: int | tuple[int, ...] = 1,
+        groups: int = 1,
+        use_bias: bool = True,
+        padding_mode: str = "zeros",
+        device: Device | str | None = None,
+        dtype: DType | None = None,
+    ) -> None:
+        super().__init__(  # type: ignore
+            in_channels,
+            out_channels,
+            kernel_size,
+            stride,
+            padding,
+            dilation,
+            groups,
+            use_bias,
+            padding_mode,
+            device,
+            dtype,
+        )
+        self.in_channels = in_channels
+        self.out_channels = out_channels
+        self.padding = (padding,) if isinstance(padding, int) else padding
+        self.dilation = (dilation,) if isinstance(dilation, int) else dilation
+        self.groups = groups
+        self.use_bias = use_bias
+        self.padding_mode = padding_mode
+
+
+class Conv1d(_Conv1d, WeightedModule):
+    def __init__(
+        self,
+        in_channels: int,
+        out_channels: int,
+        kernel_size: int | tuple[int, ...],
+        stride: int | tuple[int, ...] = 1,
+        padding: int | tuple[int, ...] | str = 0,
+        dilation: int | tuple[int, ...] = 1,
+        groups: int = 1,
+        use_bias: bool = True,
+        padding_mode: str = "zeros",
+        device: Device | str | None = None,
+        dtype: DType | None = None,
+    ) -> None:
+        super().__init__(  # type: ignore
+            in_channels,
+            out_channels,
+            kernel_size,
+            stride,
+            padding,
+            dilation,
+            groups,
+            use_bias,
+            padding_mode,
+            device,
+            dtype,
+        )
+        self.in_channels = in_channels
+        self.out_channels = out_channels
+        self.use_bias = use_bias
--- a/src/refiners/fluxion/layers/embedding.py
+++ b/src/refiners/fluxion/layers/embedding.py
@ -0,0 +1,21 @@
+from refiners.fluxion.layers.module import WeightedModule
+from torch.nn import Embedding as _Embedding
+from torch import Tensor, device as Device, dtype as DType
+
+from jaxtyping import Float, Int
+
+
+class Embedding(_Embedding, WeightedModule):  # type: ignore
+    def __init__(
+        self,
+        num_embeddings: int,
+        embedding_dim: int,
+        device: Device | str | None = None,
+        dtype: DType | None = None,
+    ):
+        _Embedding.__init__(  # type: ignore
+            self, num_embeddings=num_embeddings, embedding_dim=embedding_dim, device=device, dtype=dtype
+        )
+
+    def forward(self, x: Int[Tensor, "batch length"]) -> Float[Tensor, "batch length embedding_dim"]:  # type: ignore
+        return super().forward(x)
--- a/src/refiners/fluxion/layers/linear.py
+++ b/src/refiners/fluxion/layers/linear.py
@ -0,0 +1,50 @@
+from torch import device as Device, dtype as DType
+from torch.nn import Linear as _Linear
+from torch import Tensor
+from refiners.fluxion.layers.module import Module, WeightedModule
+from refiners.fluxion.layers.activations import ReLU
+from refiners.fluxion.layers.chain import Chain
+
+from jaxtyping import Float
+
+
+class Linear(_Linear, WeightedModule):
+    def __init__(
+        self,
+        in_features: int,
+        out_features: int,
+        bias: bool = True,
+        device: Device | str | None = None,
+        dtype: DType | None = None,
+    ) -> None:
+        self.in_features = in_features
+        self.out_features = out_features
+        super().__init__(  # type: ignore
+            in_features=in_features,
+            out_features=out_features,
+            bias=bias,
+            device=device,
+            dtype=dtype,
+        )
+
+    def forward(self, x: Float[Tensor, "batch in_features"]) -> Float[Tensor, "batch out_features"]:  # type: ignore
+        return super().forward(x)
+
+
+class MultiLinear(Chain):
+    def __init__(
+        self,
+        input_dim: int,
+        output_dim: int,
+        inner_dim: int,
+        num_layers: int,
+        device: Device | str | None = None,
+        dtype: DType | None = None,
+    ) -> None:
+        layers: list[Module] = []
+        for i in range(num_layers - 1):
+            layers.append(Linear(input_dim if i == 0 else inner_dim, inner_dim, device=device, dtype=dtype))
+            layers.append(ReLU())
+        layers.append(Linear(inner_dim, output_dim, device=device, dtype=dtype))
+
+        super().__init__(layers)
--- a/src/refiners/fluxion/layers/module.py
+++ b/src/refiners/fluxion/layers/module.py
@ -0,0 +1,100 @@
+from pathlib import Path
+from typing import Any, Generator, TypeVar
+
+from torch import device as Device, dtype as DType
+from torch.nn.modules.module import Module as TorchModule
+
+from refiners.fluxion.utils import load_from_safetensors
+from refiners.fluxion.context import Context, ContextProvider
+
+from typing import Callable, TYPE_CHECKING
+
+if TYPE_CHECKING:
+    from refiners.fluxion.layers.chain import Chain
+
+T = TypeVar("T", bound="Module")
+TContextModule = TypeVar("TContextModule", bound="ContextModule")
+
+
+class Module(TorchModule):
+    _parameters: dict[str, Any]
+    _buffers: dict[str, Any]
+
+    __getattr__: Callable[["Module", str], Any]  # type: ignore
+    __setattr__: Callable[["Module", str, Any], None]  # type: ignore
+
+    def __init__(self, *args: Any, **kwargs: Any) -> None:
+        super().__init__(*args, *kwargs)  # type: ignore
+
+    def load_from_safetensors(self, tensors_path: str | Path, strict: bool = True) -> "Module":
+        state_dict = load_from_safetensors(tensors_path)
+        self.load_state_dict(state_dict, strict=strict)
+        return self
+
+    def named_modules(self, *args: Any, **kwargs: Any) -> "Generator[tuple[str, Module], None, None]":  # type: ignore
+        return super().named_modules(*args)  # type: ignore
+
+    def to(self: T, device: Device | str | None = None, dtype: DType | None = None) -> T:  # type: ignore
+        return super().to(device=device, dtype=dtype)  # type: ignore
+
+
+class ContextModule(Module):
+    # we store parent into a one element list to avoid pytorch thinking it's a submodule
+    _parent: "list[Chain]"
+    _can_refresh_parent: bool = True  # see usage in Adapter and Chain
+
+    # Contains simple attributes set on the instance by `__init__` in subclasses
+    # and copied by `structural_copy`. Note that is not the case of `device` since
+    # Chain's __init__ takes care of it.
+    structural_attrs: list[str] = []
+
+    def __init__(self, *args: Any, **kwargs: Any) -> None:
+        super().__init__(*args, *kwargs)
+        self._parent = []
+
+    @property
+    def parent(self) -> "Chain | None":
+        return self._parent[0] if self._parent else None
+
+    @property
+    def ensure_parent(self) -> "Chain":
+        assert self._parent, "module is not bound to a Chain"
+        return self._parent[0]
+
+    def _set_parent(self, parent: "Chain | None") -> None:
+        if parent is None:
+            self._parent = []
+            return
+        # Always insert the module in the Chain first to avoid inconsistencies.
+        assert self in iter(parent), f"{self} not in {parent}"
+        self._parent = [parent]
+
+    @property
+    def provider(self) -> ContextProvider:
+        return self.ensure_parent.provider
+
+    def get_parents(self) -> "list[Chain]":
+        return self._parent + self._parent[0].get_parents() if self._parent else []
+
+    def use_context(self, context_name: str) -> Context:
+        """Retrieve the context object from the module's context provider."""
+        context = self.provider.get_context(context_name)
+        assert context is not None, f"Context {context_name} not found."
+        return context
+
+    def structural_copy(self: TContextModule) -> TContextModule:
+        clone = object.__new__(self.__class__)
+        for k in self.__class__.structural_attrs:
+            setattr(clone, k, getattr(self, k))
+        ContextModule.__init__(clone)
+        return clone
+
+
+class WeightedModule(Module):
+    @property
+    def device(self) -> Device:
+        return self.weight.device
+
+    @property
+    def dtype(self) -> DType:
+        return self.weight.dtype
--- a/src/refiners/fluxion/layers/norm.py
+++ b/src/refiners/fluxion/layers/norm.py
@ -0,0 +1,75 @@
+from torch import ones, zeros, Tensor, sqrt, device as Device, dtype as DType
+from torch.nn import GroupNorm as _GroupNorm, Parameter, LayerNorm as _LayerNorm
+from jaxtyping import Float
+from refiners.fluxion.layers.module import WeightedModule
+
+
+class LayerNorm(_LayerNorm, WeightedModule):
+    def __init__(
+        self,
+        normalized_shape: int | list[int],
+        eps: float = 0.00001,
+        elementwise_affine: bool = True,
+        device: Device | str | None = None,
+        dtype: DType | None = None,
+    ) -> None:
+        super().__init__(  # type: ignore
+            normalized_shape=normalized_shape,
+            eps=eps,
+            elementwise_affine=elementwise_affine,
+            device=device,
+            dtype=dtype,
+        )
+
+
+class GroupNorm(_GroupNorm, WeightedModule):
+    def __init__(
+        self,
+        channels: int,
+        num_groups: int,
+        eps: float = 1e-5,
+        affine: bool = True,
+        device: Device | str | None = None,
+        dtype: DType | None = None,
+    ) -> None:
+        super().__init__(  # type: ignore
+            num_groups=num_groups,
+            num_channels=channels,
+            eps=eps,
+            affine=affine,
+            device=device,
+            dtype=dtype,
+        )
+        self.channels = channels
+        self.num_groups = num_groups
+        self.eps = eps
+        self.affine = affine
+
+
+class LayerNorm2d(WeightedModule):
+    """
+    2D Layer Normalization module.
+
+    Parameters:
+        channels (int): Number of channels in the input tensor.
+        eps (float, optional): A small constant for numerical stability. Default: 1e-6.
+    """
+
+    def __init__(
+        self,
+        channels: int,
+        eps: float = 1e-6,
+        device: Device | str | None = None,
+        dtype: DType | None = None,
+    ) -> None:
+        super().__init__()
+        self.weight = Parameter(ones(channels, device=device, dtype=dtype))
+        self.bias = Parameter(zeros(channels, device=device, dtype=dtype))
+        self.eps = eps
+
+    def forward(self, x: Float[Tensor, "batch channels height width"]) -> Float[Tensor, "batch channels height width"]:
+        x_mean = x.mean(1, keepdim=True)
+        x_var = (x - x_mean).pow(2).mean(1, keepdim=True)
+        x_norm = (x - x_mean) / sqrt(x_var + self.eps)
+        x_out = self.weight.unsqueeze(-1).unsqueeze(-1) * x_norm + self.bias.unsqueeze(-1).unsqueeze(-1)
+        return x_out
--- a/src/refiners/fluxion/layers/sampling.py
+++ b/src/refiners/fluxion/layers/sampling.py
@ -0,0 +1,100 @@
+from refiners.fluxion.layers.chain import Chain, UseContext, SetContext
+from refiners.fluxion.layers.conv import Conv2d
+from refiners.fluxion.layers.basics import Identity
+from refiners.fluxion.layers.chain import Parallel, Lambda
+from refiners.fluxion.layers.module import Module
+from refiners.fluxion.utils import interpolate
+from torch.nn.functional import pad
+from torch import Tensor, Size, device as Device, dtype as DType
+
+
+class Downsample(Chain):
+    structural_attrs = ["channels", "in_channels", "out_channels", "scale_factor", "padding"]
+
+    def __init__(
+        self,
+        channels: int,
+        scale_factor: int,
+        padding: int = 0,
+        register_shape: bool = True,
+        device: Device | str | None = None,
+        dtype: DType | None = None,
+    ):
+        """Downsamples the input by the given scale factor.
+
+        If register_shape is True, the input shape is registered in the context. It will throw an error if the context
+        sampling is not set or if the context does not contain a list.
+        """
+        self.channels = channels
+        self.in_channels = channels
+        self.out_channels = channels
+        self.scale_factor = scale_factor
+        self.padding = padding
+        super().__init__(
+            Conv2d(
+                in_channels=channels,
+                out_channels=channels,
+                kernel_size=3,
+                stride=scale_factor,
+                padding=padding,
+                device=device,
+                dtype=dtype,
+            ),
+        )
+        if padding == 0:
+            self.insert(0, Lambda(lambda x: pad(x, (0, 1, 0, 1))))
+        if register_shape:
+            self.insert(0, SetContext(context="sampling", key="shapes", callback=self.register_shape))
+
+    def register_shape(self, shapes: list[Size], x: Tensor) -> None:
+        shapes.append(x.shape[2:])
+
+
+class Interpolate(Module):
+    def __init__(self) -> None:
+        super().__init__()
+
+    def forward(self, x: Tensor, shape: Size) -> Tensor:
+        return interpolate(x, shape)
+
+
+class Upsample(Chain):
+    structural_attrs = ["channels", "upsample_factor"]
+
+    def __init__(
+        self,
+        channels: int,
+        upsample_factor: int | None = None,
+        device: Device | str | None = None,
+        dtype: DType | None = None,
+    ):
+        """Upsamples the input by the given scale factor.
+
+        If upsample_factor is None, the input shape is taken from the context. It will throw an error if the context
+        sampling is not set or if the context is empty (then you should use the dynamic version of Downsample).
+        """
+        self.channels = channels
+        self.upsample_factor = upsample_factor
+        super().__init__(
+            Parallel(
+                Identity(),
+                (
+                    Lambda(self._get_static_shape)
+                    if upsample_factor is not None
+                    else UseContext(context="sampling", key="shapes").compose(lambda x: x.pop())
+                ),
+            ),
+            Interpolate(),
+            Conv2d(
+                in_channels=channels,
+                out_channels=channels,
+                kernel_size=3,
+                padding=1,
+                device=device,
+                dtype=dtype,
+            ),
+        )
+
+    def _get_static_shape(self, x: Tensor) -> Size:
+        assert self.upsample_factor is not None
+        return Size([size * self.upsample_factor for size in x.shape[2:]])
--- a/src/refiners/fluxion/utils.py
+++ b/src/refiners/fluxion/utils.py
@ -0,0 +1,262 @@
+from collections import defaultdict
+from typing import TYPE_CHECKING, Any, Callable, Dict, Iterable, Literal, TypeVar
+from PIL import Image
+from numpy import array, float32
+from pathlib import Path
+from safetensors import safe_open as _safe_open  # type: ignore
+from safetensors.torch import save_file as _save_file  # type: ignore
+from torch import norm as _norm, manual_seed as _manual_seed  # type: ignore
+from torch.nn.functional import pad as _pad, interpolate as _interpolate  # type: ignore
+from torch import Size, Tensor, tensor, no_grad, device as Device, dtype as DType
+from torch.utils.hooks import RemovableHandle
+
+if TYPE_CHECKING:
+    from refiners.fluxion.layers.module import Module
+
+
+T = TypeVar("T")
+E = TypeVar("E")
+
+
+def norm(x: Tensor) -> Tensor:
+    return _norm(x)  # type: ignore
+
+
+def manual_seed(seed: int) -> None:
+    _manual_seed(seed)
+
+
+def pad(x: Tensor, pad: Iterable[int], value: float = 0.0) -> Tensor:
+    return _pad(input=x, pad=pad, value=value)  # type: ignore
+
+
+def interpolate(x: Tensor, factor: float | Size, mode: str = "nearest") -> Tensor:
+    return (
+        _interpolate(x, scale_factor=factor, mode=mode)
+        if isinstance(factor, float | int)
+        else _interpolate(x, size=factor, mode=mode)
+    )  # type: ignore
+
+
+def bidirectional_mapping(mapping: Dict[str, str]) -> Dict[str, str]:
+    return {**mapping, **{value: key for key, value in mapping.items()}}
+
+
+def image_to_tensor(image: Image.Image, device: Device | str | None = None, dtype: DType | None = None) -> Tensor:
+    return tensor(array(image).astype(float32).transpose(2, 0, 1) / 255.0, device=device, dtype=dtype).unsqueeze(0)
+
+
+def tensor_to_image(tensor: Tensor) -> Image.Image:
+    return Image.fromarray((tensor.clamp(0, 1).squeeze(0).permute(1, 2, 0).cpu().numpy() * 255).astype("uint8"))  # type: ignore
+
+
+def safe_open(
+    path: Path | str,
+    framework: Literal["pytorch", "tensorflow", "flax", "numpy"],
+    device: Device | str = "cpu",
+) -> dict[str, Tensor]:
+    framework_mapping = {
+        "pytorch": "pt",
+        "tensorflow": "tf",
+        "flax": "flax",
+        "numpy": "numpy",
+    }
+    return _safe_open(str(path), framework=framework_mapping[framework], device=str(device))  # type: ignore
+
+
+def load_from_safetensors(path: Path | str, device: Device | str = "cpu") -> dict[str, Tensor]:
+    with safe_open(path=path, framework="pytorch", device=device) as tensors:  # type: ignore
+        return {key: tensors.get_tensor(key) for key in tensors.keys()}  # type: ignore
+
+
+def load_metadata_from_safetensors(path: Path | str) -> dict[str, str] | None:
+    with safe_open(path=path, framework="pytorch") as tensors:  # type: ignore
+        return tensors.metadata()  # type: ignore
+
+
+def save_to_safetensors(path: Path | str, tensors: dict[str, Tensor], metadata: dict[str, str] | None = None) -> None:
+    _save_file(tensors, path, metadata)  # type: ignore
+
+
+BASIC_LAYERS: list[str] = [
+    "Conv1d",
+    "Conv2d",
+    "Conv3d",
+    "Linear",
+    "BatchNorm1d",
+    "BatchNorm2d",
+    "BatchNorm3d",
+    "LayerNorm",
+    "GroupNorm",
+    "Embedding",
+    "MaxPool2d",
+    "AvgPool2d",
+    "AdaptiveAvgPool2d",
+]
+
+ModelTypeShape = tuple[str, tuple[Size, ...]]
+
+
+def is_basic_layer(module: "Module") -> bool:
+    return module.__class__.__name__ in BASIC_LAYERS
+
+
+def get_module_signature(module: "Module") -> ModelTypeShape:
+    param_shapes = [p.shape for p in module.parameters()]
+    return (module.__class__.__name__, tuple(param_shapes))
+
+
+def forward_order_of_execution(
+    module: "Module",
+    example_args: tuple[Any, ...],
+    key_skipper: Callable[[str], bool] | None = None,
+) -> dict[ModelTypeShape, list[str]]:
+    key_skipper = key_skipper or (lambda _: False)
+
+    submodule_to_key: dict["Module", str] = {}
+    execution_order: defaultdict[ModelTypeShape, list[str]] = defaultdict(list)
+
+    def collect_execution_order_hook(layer: "Module", *_: Any):
+        layer_signature = get_module_signature(layer)
+        execution_order[layer_signature].append(submodule_to_key[layer])
+
+    hooks: list[RemovableHandle] = []
+    for name, submodule in module.named_modules():
+        if is_basic_layer(submodule) and not key_skipper(name):
+            submodule_to_key[submodule] = name
+            hook = submodule.register_forward_hook(collect_execution_order_hook)
+            hooks.append(hook)
+
+    with no_grad():
+        module(*example_args)
+
+    for hook in hooks:
+        hook.remove()
+
+    return dict(execution_order)
+
+
+def print_side_by_side(
+    shape: ModelTypeShape,
+    source_keys: list[str],
+    target_keys: list[str],
+):
+    print(f"{shape}")
+    max_len = max(len(source_keys), len(target_keys))
+    for i in range(max_len):
+        source_key = source_keys[i] if i < len(source_keys) else "---"
+        target_key = target_keys[i] if i < len(target_keys) else "---"
+        print(f"\t{source_key}\t{target_key}")
+
+
+def verify_shape_match(
+    source_order: dict[ModelTypeShape, list[str]], target_order: dict[ModelTypeShape, list[str]]
+) -> bool:
+    model_type_shapes = set(source_order.keys()) | set(target_order.keys())
+    shape_missmatched = False
+
+    for model_type_shape in model_type_shapes:
+        source_keys = source_order.get(model_type_shape, [])
+        target_keys = target_order.get(model_type_shape, [])
+
+        if len(source_keys) != len(target_keys):
+            shape_missmatched = True
+            print_side_by_side(model_type_shape, source_keys, target_keys)
+
+    return not shape_missmatched
+
+
+def create_state_dict_mapping(
+    source_model: "Module",
+    target_model: "Module",
+    source_args: tuple[Any, ...],
+    target_args: tuple[Any, ...] | None = None,
+    source_key_skipper: Callable[[str], bool] | None = None,
+    target_key_skipper: Callable[[str], bool] | None = None,
+) -> dict[str, str] | None:
+    if target_args is None:
+        target_args = source_args
+
+    source_order = forward_order_of_execution(source_model, source_args, source_key_skipper)
+    target_order = forward_order_of_execution(target_model, target_args, target_key_skipper)
+
+    if not verify_shape_match(source_order, target_order):
+        return None
+
+    mapping: dict[str, str] = {}
+    for model_type_shape in source_order:
+        source_keys = source_order[model_type_shape]
+        target_keys = target_order[model_type_shape]
+        mapping.update(zip(target_keys, source_keys))
+
+    return mapping
+
+
+def convert_state_dict(
+    source_state_dict: dict[str, Tensor], target_state_dict: dict[str, Tensor], state_dict_mapping: dict[str, str]
+) -> dict[str, Tensor]:
+    converted_state_dict: dict[str, Tensor] = {}
+    for target_key in target_state_dict:
+        target_prefix, suffix = target_key.rsplit(".", 1)
+        source_prefix = state_dict_mapping[target_prefix]
+        source_key = ".".join([source_prefix, suffix])
+        converted_state_dict[target_key] = source_state_dict[source_key]
+
+    return converted_state_dict
+
+
+def forward_store_outputs(
+    module: "Module",
+    example_args: tuple[Any, ...],
+    key_skipper: Callable[[str], bool] | None = None,
+) -> list[tuple[str, Tensor]]:
+    key_skipper = key_skipper or (lambda _: False)
+    submodule_to_key: dict["Module", str] = {}
+    execution_order: list[tuple[str, Tensor]] = []  # Store outputs in a list
+
+    def collect_execution_order_hook(layer: "Module", _: Any, output: Tensor):
+        execution_order.append((submodule_to_key[layer], output.clone()))  # Store a copy of the output
+
+    hooks: list[RemovableHandle] = []
+    for name, submodule in module.named_modules():
+        if is_basic_layer(submodule) and not key_skipper(name):
+            submodule_to_key[submodule] = name
+            hook = submodule.register_forward_hook(collect_execution_order_hook)
+            hooks.append(hook)
+
+    with no_grad():
+        module(*example_args)
+
+    for hook in hooks:
+        hook.remove()
+
+    return execution_order
+
+
+def compare_models(
+    source_model: "Module",
+    target_model: "Module",
+    source_args: tuple[Any, ...],
+    target_args: tuple[Any, ...] | None = None,
+    source_key_skipper: Callable[[str], bool] | None = None,
+    target_key_skipper: Callable[[str], bool] | None = None,
+    threshold: float = 1e-5,
+) -> bool:
+    if target_args is None:
+        target_args = source_args
+
+    source_order = forward_store_outputs(source_model, source_args, source_key_skipper)
+    target_order = forward_store_outputs(target_model, target_args, target_key_skipper)
+
+    prev_source_key, prev_target_key = None, None
+    for (source_key, source_output), (target_key, target_output) in zip(source_order, target_order):
+        diff = norm(source_output - target_output).item()
+        if diff > threshold:
+            print(
+                f"Models diverged between {prev_source_key} and {source_key}, and between {prev_target_key} and"
+                f" {target_key}, difference in norm: {diff}"
+            )
+            return False
+        prev_source_key, prev_target_key = source_key, target_key
+
+    return True
--- a/src/refiners/foundationals/init.py
+++ b/src/refiners/foundationals/init.py
--- a/src/refiners/foundationals/clip/init.py
+++ b/src/refiners/foundationals/clip/init.py
--- a/src/refiners/foundationals/clip/bpe_simple_vocab_16e6.txt.gz
+++ b/src/refiners/foundationals/clip/bpe_simple_vocab_16e6.txt.gz
--- a/src/refiners/foundationals/clip/image_encoder.py
+++ b/src/refiners/foundationals/clip/image_encoder.py
--- a/src/refiners/foundationals/clip/text_encoder.py
+++ b/src/refiners/foundationals/clip/text_encoder.py
@ -0,0 +1,250 @@
+from torch import Tensor, arange, device as Device, dtype as DType
+
+from refiners.fluxion.layers import (
+    ApproximateGeLU,
+    GeLU,
+    Linear,
+    LayerNorm,
+    Embedding,
+    Chain,
+    Sum,
+    SelfAttention,
+    Lambda,
+    Residual,
+)
+from refiners.foundationals.clip.tokenizer import CLIPTokenizer
+
+
+class PositionalTokenEncoder(Sum):
+    structural_attrs = ["vocabulary_size", "positional_embedding_dim"]
+
+    def __init__(
+        self,
+        vocabulary_size: int,
+        embedding_dim: int,
+        positional_embedding_dim: int,
+        device: Device | str | None = None,
+        dtype: DType | None = None,
+    ):
+        self.vocabulary_size = vocabulary_size
+        self.positional_embedding_dim = positional_embedding_dim
+        super().__init__(
+            Embedding(
+                num_embeddings=vocabulary_size,
+                embedding_dim=embedding_dim,
+                device=device,
+                dtype=dtype,
+            ),
+            Chain(
+                Lambda(self.get_position_ids),
+                Embedding(
+                    num_embeddings=positional_embedding_dim,
+                    embedding_dim=embedding_dim,
+                    device=device,
+                    dtype=dtype,
+                ),
+            ),
+        )
+
+    @property
+    def position_ids(self) -> Tensor:
+        return arange(self.positional_embedding_dim, device=self.device).reshape(1, -1)
+
+    def get_position_ids(self, x: Tensor) -> Tensor:
+        return self.position_ids[:, : x.shape[1]]
+
+
+class FeedForward(Chain):
+    structural_attrs = ["embedding_dim", "feedforward_dim"]
+
+    def __init__(
+        self,
+        embedding_dim: int,
+        feedforward_dim: int,
+        device: Device | str | None = None,
+        dtype: DType | None = None,
+    ) -> None:
+        self.embedding_dim = embedding_dim
+        self.feedforward_dim = feedforward_dim
+        super().__init__(
+            Linear(in_features=embedding_dim, out_features=feedforward_dim, device=device, dtype=dtype),
+            GeLU(),
+            Linear(in_features=feedforward_dim, out_features=embedding_dim, device=device, dtype=dtype),
+        )
+
+
+class TransformerLayer(Chain):
+    structural_attrs = ["embedding_dim", "num_attention_heads", "feedforward_dim", "layer_norm_eps"]
+
+    def __init__(
+        self,
+        embedding_dim: int,
+        feedforward_dim: int,
+        num_attention_heads: int = 1,
+        layer_norm_eps: float = 1e-5,
+        device: Device | str | None = None,
+        dtype: DType | None = None,
+    ) -> None:
+        self.embedding_dim = embedding_dim
+        self.num_attention_heads = num_attention_heads
+        self.feedforward_dim = feedforward_dim
+        self.layer_norm_eps = layer_norm_eps
+        super().__init__(
+            Residual(
+                LayerNorm(
+                    normalized_shape=embedding_dim,
+                    eps=layer_norm_eps,
+                    device=device,
+                    dtype=dtype,
+                ),
+                SelfAttention(
+                    embedding_dim=embedding_dim,
+                    num_heads=num_attention_heads,
+                    is_causal=True,
+                    device=device,
+                    dtype=dtype,
+                ),
+            ),
+            Residual(
+                LayerNorm(
+                    normalized_shape=embedding_dim,
+                    eps=layer_norm_eps,
+                    device=device,
+                    dtype=dtype,
+                ),
+                FeedForward(
+                    embedding_dim=embedding_dim,
+                    feedforward_dim=feedforward_dim,
+                    device=device,
+                    dtype=dtype,
+                ),
+            ),
+        )
+
+
+class CLIPTextEncoder(Chain):
+    structural_attrs = [
+        "embedding_dim",
+        "positional_embedding_dim",
+        "vocabulary_size",
+        "num_layers",
+        "num_attention_heads",
+        "feedforward_dim",
+        "layer_norm_eps",
+        "tokenizer",
+    ]
+
+    def __init__(
+        self,
+        embedding_dim: int = 768,
+        positional_embedding_dim: int = 77,
+        vocabulary_size: int = 49408,
+        num_layers: int = 12,
+        num_attention_heads: int = 12,
+        feedforward_dim: int = 3072,
+        layer_norm_eps: float = 1e-5,
+        device: Device | str | None = None,
+        dtype: DType | None = None,
+    ):
+        self.embedding_dim = embedding_dim
+        self.positional_embedding_dim = positional_embedding_dim
+        self.vocabulary_size = vocabulary_size
+        self.num_layers = num_layers
+        self.num_attention_heads = num_attention_heads
+        self.feedforward_dim = feedforward_dim
+        self.layer_norm_eps = layer_norm_eps
+        self.tokenizer = CLIPTokenizer()
+        super().__init__(
+            PositionalTokenEncoder(
+                vocabulary_size=vocabulary_size,
+                embedding_dim=embedding_dim,
+                positional_embedding_dim=positional_embedding_dim,
+                device=device,
+                dtype=dtype,
+            ),
+            *(
+                TransformerLayer(
+                    embedding_dim=embedding_dim,
+                    num_attention_heads=num_attention_heads,
+                    feedforward_dim=feedforward_dim,
+                    layer_norm_eps=layer_norm_eps,
+                    device=device,
+                    dtype=dtype,
+                )
+                for _ in range(num_layers)
+            ),
+            LayerNorm(normalized_shape=embedding_dim, eps=layer_norm_eps, device=device, dtype=dtype),
+        )
+
+    def encode(self, text: str) -> Tensor:
+        tokens = self.tokenizer(text, sequence_length=self.positional_embedding_dim).to(self.device)
+        return self(tokens)
+
+    @property
+    def unconditional_text_embedding(self) -> Tensor:
+        return self.encode("")
+
+
+class CLIPTextEncoderL(CLIPTextEncoder):
+    """
+    CLIPTextEncoderL is the CLIP text encoder with the following parameters:
+    embedding_dim=768
+    num_layers=12
+    num_attention_heads=12
+    feedforward_dim=3072
+
+    We replace the GeLU activation function with an approximate GeLU to comply with the original CLIP implementation
+    of OpenAI (https://github.com/openai/CLIP/blob/main/clip/model.py#L166)
+    """
+
+    def __init__(self, device: Device | str | None = None, dtype: DType | None = None) -> None:
+        super().__init__(
+            embedding_dim=768,
+            num_layers=12,
+            num_attention_heads=12,
+            feedforward_dim=3072,
+            device=device,
+            dtype=dtype,
+        )
+        for gelu, parent in self.walk(lambda m, _: isinstance(m, GeLU)):
+            parent.replace(old_module=gelu, new_module=ApproximateGeLU())
+
+
+class CLIPTextEncoderH(CLIPTextEncoder):
+    """
+    CLIPTextEncoderH is the CLIP text encoder with the following parameters:
+    embedding_dim=1024
+    num_layers=23
+    num_attention_heads=16
+    feedforward_dim=4096
+    """
+
+    def __init__(self, device: Device | str | None = None, dtype: DType | None = None) -> None:
+        super().__init__(
+            embedding_dim=1024,
+            num_layers=23,
+            num_attention_heads=16,
+            feedforward_dim=4096,
+            device=device,
+            dtype=dtype,
+        )
+
+
+class CLIPTextEncoderG(CLIPTextEncoder):
+    """
+    CLIPTextEncoderG is the CLIP text encoder with the following parameters:
+    embedding_dim=1280
+    num_layers=32
+    num_attention_heads=16
+    feedforward_dim=5120
+    """
+
+    def __init__(self, device: Device | str | None = None, dtype: DType | None = None) -> None:
+        super().__init__(
+            embedding_dim=1280,
+            num_layers=32,
+            num_attention_heads=20,
+            feedforward_dim=5120,
+            device=device,
+            dtype=dtype,
+        )
--- a/src/refiners/foundationals/clip/tokenizer.py
+++ b/src/refiners/foundationals/clip/tokenizer.py
@ -0,0 +1,108 @@
+import gzip
+from pathlib import Path
+from functools import lru_cache
+from itertools import islice
+import re
+from torch import Tensor, tensor
+from refiners.fluxion import pad
+
+
+class CLIPTokenizer:
+    def __init__(
+        self,
+        vocabulary_path: str | Path = Path(__file__).resolve().parent / "bpe_simple_vocab_16e6.txt.gz",
+    ):
+        self.vocabulary_path = vocabulary_path
+        self.byte_to_unicode_mapping = self.get_bytes_to_unicode_mapping()
+        self.byte_decoder = {v: k for k, v in self.byte_to_unicode_mapping.items()}
+        merge_tuples = [
+            tuple(merge.split())
+            for merge in gzip.open(vocabulary_path).read().decode("utf-8").split("\n")[1 : 49152 - 256 - 2 + 1]
+        ]
+        vocabulary = (
+            list(self.byte_to_unicode_mapping.values())
+            + [v + "</w>" for v in self.byte_to_unicode_mapping.values()]
+            + ["".join(merge) for merge in merge_tuples]
+            + ["", ""]
+        )
+        self.token_to_id_mapping = {token: i for i, token in enumerate(vocabulary)}
+        self.byte_pair_encoding_ranks = {merge: i for i, merge in enumerate(merge_tuples)}
+        self.byte_pair_encoding_cache = {"": ""}
+        # Note: this regular expression does not support Unicode. It was changed so
+        # to get rid of the dependence on the `regex` module. Unicode support could
+        # potentially be added back by leveraging the `\w` character class.
+        self.token_pattern = re.compile(
+            r"""<\|startoftext\|>|<\|endoftext\|>|'s|'t|'re|'ve|'m|'ll|'d|[a-zA-Z]+|[0-9]|[^\s\w]+""",
+            re.IGNORECASE,
+        )
+        self.start_of_text_token_id: int = 49406
+        self.end_of_text_token_id: int = 49407
+
+    def __call__(self, text: str, sequence_length: int) -> Tensor:
+        tokens = self.encode(text=text, max_length=sequence_length).unsqueeze(0)
+        assert (
+            tokens.shape[1] <= sequence_length
+        ), f"Text is too long: tokens.shape[1] > sequence_length: {tokens.shape[1]} > {sequence_length}"
+        return pad(tokens, (0, sequence_length - tokens.shape[1]), value=self.end_of_text_token_id)
+
+    @lru_cache()
+    def get_bytes_to_unicode_mapping(self) -> dict[int, str]:
+        initial_byte_values = (
+            list(range(ord("!"), ord("~") + 1))
+            + list(range(ord("¡"), ord("¬") + 1))
+            + list(range(ord("®"), ord("ÿ") + 1))
+        )
+        extra_unicode_values = (byte for byte in range(2**8) if byte not in initial_byte_values)
+        byte_values = initial_byte_values + list(extra_unicode_values)
+        unicode_values = [chr(value) for value in byte_values]
+        return dict(zip(byte_values, unicode_values))
+
+    def byte_pair_encoding(self, token: str) -> str:
+        if token in self.byte_pair_encoding_cache:
+            return self.byte_pair_encoding_cache[token]
+
+        def recursive_bpe(word: tuple[str, ...]) -> tuple[str, ...]:
+            if len(word) < 2:
+                return word
+            pairs = {(i, (word[i], word[i + 1])) for i in range(len(word) - 1)}
+            min_pair = min(
+                pairs,
+                key=lambda pair: self.byte_pair_encoding_ranks.get(pair[1], float("inf")),
+            )
+            if min_pair[1] not in self.byte_pair_encoding_ranks:
+                return word
+            new_word: list[str] = []
+            i = 0
+            while i < len(word):
+                if i == min_pair[0]:
+                    new_word.append(min_pair[1][0] + min_pair[1][1])
+                    i += 2
+                else:
+                    new_word.append(word[i])
+                    i += 1
+            return recursive_bpe(tuple(new_word))
+
+        word = tuple(token[:-1]) + (token[-1] + "</w>",)
+        result = " ".join(recursive_bpe(word))
+        self.byte_pair_encoding_cache[token] = result
+        return result
+
+    def encode(self, text: str, max_length: int | None = None) -> Tensor:
+        text = re.sub(r"\s+", " ", text.lower())
+        tokens = re.findall(self.token_pattern, text)
+        upper_bound = None
+        if max_length:
+            assert max_length >= 2
+            upper_bound = max_length - 2
+        encoded_tokens = islice(
+            (
+                self.token_to_id_mapping[subtoken]
+                for token in tokens
+                for subtoken in self.byte_pair_encoding(
+                    "".join(self.byte_to_unicode_mapping[character] for character in token.encode("utf-8"))
+                ).split(" ")
+            ),
+            0,
+            upper_bound,
+        )
+        return tensor([self.start_of_text_token_id, *encoded_tokens, self.end_of_text_token_id])
--- a/src/refiners/foundationals/latent_diffusion/init.py
+++ b/src/refiners/foundationals/latent_diffusion/init.py
@ -0,0 +1,201 @@
+from typing import TypeVar
+from torch import cat, float32, randn, tensor, device as Device, dtype as DType, Size, Tensor
+from PIL import Image
+import numpy as np
+
+from refiners.fluxion.utils import image_to_tensor, interpolate
+from refiners.fluxion.layers.module import Module
+from refiners.foundationals.latent_diffusion.auto_encoder import (
+    LatentDiffusionAutoencoder,
+)
+from refiners.foundationals.clip.text_encoder import (
+    CLIPTextEncoder,
+    CLIPTextEncoderL,
+)
+from refiners.foundationals.latent_diffusion.schedulers import Scheduler, DPMSolver
+from refiners.foundationals.latent_diffusion.unet import UNet
+
+
+TLatentDiffusionModel = TypeVar("TLatentDiffusionModel", bound="LatentDiffusionModel")
+
+__all__ = [
+    "LatentDiffusionModel",
+    "UNet",
+    "DPMSolver",
+    "Scheduler",
+    "CLIPTextEncoder",
+    "LatentDiffusionAutoencoder",
+]
+
+
+class LatentDiffusionModel(Module):
+    def __init__(
+        self,
+        unet: UNet,
+        lda: LatentDiffusionAutoencoder,
+        clip_text_encoder: CLIPTextEncoder,
+        scheduler: Scheduler,
+        device: Device | str = "cpu",
+        dtype: DType = float32,
+    ):
+        super().__init__()
+        self.device: Device = device if isinstance(device, Device) else Device(device)
+        self.dtype = dtype
+        self.unet = unet.to(self.device, dtype=self.dtype)
+        self.lda = lda.to(self.device, dtype=self.dtype)
+        self.clip_text_encoder = clip_text_encoder.to(self.device, dtype=self.dtype)
+        self.scheduler = scheduler.to(self.device, dtype=self.dtype)
+
+    def set_num_inference_steps(self, num_inference_steps: int):
+        initial_diffusion_rate = self.scheduler.initial_diffusion_rate
+        final_diffusion_rate = self.scheduler.final_diffusion_rate
+        device, dtype = self.scheduler.device, self.scheduler.dtype
+        self.scheduler = self.scheduler.__class__(
+            num_inference_steps,
+            initial_diffusion_rate=initial_diffusion_rate,
+            final_diffusion_rate=final_diffusion_rate,
+        ).to(device=device, dtype=dtype)
+
+    def init_latents(
+        self,
+        size: tuple[int, int],
+        init_image: Image.Image | None = None,
+        first_step: int = 0,
+        noise: Tensor | None = None,
+    ) -> Tensor:
+        if noise is None:
+            height, width = size
+            noise = randn(1, 4, height // 8, width // 8, device=self.device)
+        assert list(noise.shape[2:]) == [
+            size[0] // 8,
+            size[1] // 8,
+        ], f"noise shape is not compatible: {noise.shape}, with size: {size}"
+        if init_image is None:
+            return noise
+        encoded_image = self.lda.encode_image(init_image.resize(size))
+        return self.scheduler.add_noise(encoded_image, noise, self.steps[first_step])
+
+    @property
+    def steps(self) -> list[int]:
+        return self.scheduler.steps
+
+    @property
+    def timestep_embeddings(self) -> Tensor:
+        return self.timestep_encoder(self.scheduler.timesteps)
+
+    @property
+    def unconditional_clip_text_embeddings(self) -> Tensor:
+        return self.clip_text_encoder.unconditional_text_embedding
+
+    def compute_text_embedding(self, text: str) -> Tensor:
+        return self.clip_text_encoder.encode(text)
+
+    def forward(
+        self,
+        x: Tensor,
+        step: int,
+        clip_text_embedding: Tensor,
+        negative_clip_text_embedding: Tensor | None = None,
+        condition_scale: float = 7.5,
+    ) -> Tensor:
+        timestep = self.scheduler.timesteps[step].unsqueeze(0)
+        self.unet.set_timestep(timestep)
+
+        negative_clip_text_embedding = (
+            self.clip_text_encoder.unconditional_text_embedding
+            if negative_clip_text_embedding is None
+            else negative_clip_text_embedding
+        )
+
+        clip_text_embeddings = cat((negative_clip_text_embedding, clip_text_embedding))
+
+        self.unet.set_clip_text_embedding(clip_text_embeddings)
+        latents = cat((x, x))  # for classifier-free guidance
+        unconditional_prediction, conditional_prediction = self.unet(latents).chunk(2)
+
+        # classifier-free guidance
+        noise = unconditional_prediction + condition_scale * (conditional_prediction - unconditional_prediction)
+        x = x.narrow(dim=1, start=0, length=4)  # support > 4 channels for inpainting
+        return self.scheduler(x, noise=noise, step=step)
+
+    def structural_copy(self: TLatentDiffusionModel) -> TLatentDiffusionModel:
+        return self.__class__(
+            unet=self.unet.structural_copy(),
+            lda=self.lda.structural_copy(),
+            clip_text_encoder=self.clip_text_encoder.structural_copy(),
+            scheduler=self.scheduler,
+            device=self.device,
+            dtype=self.dtype,
+        )
+
+
+class StableDiffusion_1(LatentDiffusionModel):
+    def __init__(
+        self,
+        unet: UNet | None = None,
+        lda: LatentDiffusionAutoencoder | None = None,
+        clip_text_encoder: CLIPTextEncoderL | None = None,
+        scheduler: Scheduler | None = None,
+        device: Device | str = "cpu",
+        dtype: DType = float32,
+    ):
+        unet = unet or UNet(in_channels=4, clip_embedding_dim=768)
+        lda = lda or LatentDiffusionAutoencoder()
+        clip_text_encoder = clip_text_encoder or CLIPTextEncoderL()
+        scheduler = scheduler or DPMSolver(num_inference_steps=30)
+
+        super().__init__(
+            unet,
+            lda,
+            clip_text_encoder=clip_text_encoder,
+            scheduler=scheduler,
+            device=device,
+            dtype=dtype,
+        )
+
+
+class StableDiffusion_1_Inpainting(StableDiffusion_1):
+    def __init__(
+        self,
+        unet: UNet | None = None,
+        lda: LatentDiffusionAutoencoder | None = None,
+        clip_text_encoder: CLIPTextEncoderL | None = None,
+        scheduler: Scheduler | None = None,
+        device: Device | str = "cpu",
+        dtype: DType = float32,
+    ):
+        self.mask_latents: Tensor | None = None
+        self.target_image_latents: Tensor | None = None
+        super().__init__(unet, lda, clip_text_encoder, scheduler, device, dtype)
+
+    def forward(
+        self,
+        x: Tensor,
+        step: int,
+        clip_text_embedding: Tensor,
+        negative_clip_text_embedding: Tensor | None = None,
+        condition_scale: float = 7.5,
+    ):
+        assert self.mask_latents is not None
+        assert self.target_image_latents is not None
+        x = cat((x, self.mask_latents, self.target_image_latents), dim=1)
+        return super().forward(x, step, clip_text_embedding, negative_clip_text_embedding, condition_scale)
+
+    def set_inpainting_conditions(
+        self,
+        target_image: Image.Image,
+        mask: Image.Image,
+        latents_size: tuple[int, int] = (64, 64),
+    ) -> tuple[Tensor, Tensor]:
+        target_image = target_image.convert("RGB")
+        mask = mask.convert("L")
+
+        mask_tensor = tensor(np.array(mask).astype(np.float32) / 255.0).to(self.device)
+        mask_tensor = (mask_tensor > 0.5).unsqueeze(0).unsqueeze(0).to(dtype=self.dtype)
+        self.mask_latents = interpolate(mask_tensor, Size(latents_size))
+
+        init_image_tensor = image_to_tensor(target_image, device=self.device, dtype=self.dtype) * 2 - 1
+        masked_init_image = init_image_tensor * (1 - mask_tensor)
+        self.target_image_latents = self.lda.encode(masked_init_image)
+
+        return self.mask_latents, self.target_image_latents
--- a/src/refiners/foundationals/latent_diffusion/auto_encoder.py
+++ b/src/refiners/foundationals/latent_diffusion/auto_encoder.py
@ -0,0 +1,230 @@
+from refiners.fluxion.context import Contexts
+from refiners.fluxion.layers import (
+    Chain,
+    Conv2d,
+    GroupNorm,
+    Identity,
+    SiLU,
+    Downsample,
+    Upsample,
+    Sum,
+    SelfAttention2d,
+    Slicing,
+)
+from refiners.fluxion.utils import image_to_tensor, tensor_to_image
+from torch import Tensor, device as Device, dtype as DType
+from PIL import Image
+
+
+class Resnet(Sum):
+    structural_attrs = ["in_channels", "out_channels"]
+
+    def __init__(
+        self,
+        in_channels: int,
+        out_channels: int,
+        num_groups: int = 32,
+        device: Device | str | None = None,
+        dtype: DType | None = None,
+    ):
+        self.in_channels = in_channels
+        self.out_channels = out_channels
+        shortcut = (
+            Conv2d(in_channels=in_channels, out_channels=out_channels, kernel_size=1, device=device, dtype=dtype)
+            if in_channels != out_channels
+            else Identity()
+        )
+        super().__init__(
+            shortcut,
+            Chain(
+                GroupNorm(channels=in_channels, num_groups=num_groups, device=device, dtype=dtype),
+                SiLU(),
+                Conv2d(
+                    in_channels=in_channels,
+                    out_channels=out_channels,
+                    kernel_size=3,
+                    padding=1,
+                    device=device,
+                    dtype=dtype,
+                ),
+                GroupNorm(channels=out_channels, num_groups=num_groups, device=device, dtype=dtype),
+                SiLU(),
+                Conv2d(
+                    in_channels=out_channels,
+                    out_channels=out_channels,
+                    kernel_size=3,
+                    padding=1,
+                    device=device,
+                    dtype=dtype,
+                ),
+            ),
+        )
+
+
+class Encoder(Chain):
+    def __init__(self, device: Device | str | None = None, dtype: DType | None = None) -> None:
+        resnet_sizes: list[int] = [128, 256, 512, 512, 512]
+        input_channels: int = 3
+        latent_dim: int = 8
+        resnet_layers: list[Chain] = [
+            Chain(
+                [
+                    Resnet(
+                        in_channels=resnet_sizes[i - 1] if i > 0 else resnet_sizes[0],
+                        out_channels=resnet_sizes[i],
+                        device=device,
+                        dtype=dtype,
+                    ),
+                    Resnet(in_channels=resnet_sizes[i], out_channels=resnet_sizes[i], device=device, dtype=dtype),
+                ]
+            )
+            for i in range(len(resnet_sizes))
+        ]
+        for _, layer in zip(range(3), resnet_layers):
+            channels: int = layer[-1].out_channels  # type: ignore
+            layer.append(Downsample(channels=channels, scale_factor=2, device=device, dtype=dtype))
+
+        attention_layer = Sum(
+            Identity(),
+            Chain(
+                GroupNorm(channels=resnet_sizes[-1], num_groups=32, eps=1e-6, device=device, dtype=dtype),
+                SelfAttention2d(channels=resnet_sizes[-1], device=device, dtype=dtype),
+            ),
+        )
+        resnet_layers[-1].insert_after_type(Resnet, attention_layer)
+        super().__init__(
+            Conv2d(
+                in_channels=input_channels,
+                out_channels=resnet_sizes[0],
+                kernel_size=3,
+                padding=1,
+                device=device,
+                dtype=dtype,
+            ),
+            Chain(*resnet_layers),
+            Chain(
+                GroupNorm(channels=resnet_sizes[-1], num_groups=32, eps=1e-6, device=device, dtype=dtype),
+                SiLU(),
+                Conv2d(
+                    in_channels=resnet_sizes[-1],
+                    out_channels=latent_dim,
+                    kernel_size=3,
+                    padding=1,
+                    device=device,
+                    dtype=dtype,
+                ),
+            ),
+            Chain(
+                Conv2d(in_channels=8, out_channels=8, kernel_size=1, device=device, dtype=dtype),
+                Slicing(dim=1, start=0, length=4),
+            ),
+        )
+
+    def init_context(self) -> Contexts:
+        return {"sampling": {"shapes": []}}
+
+
+class Decoder(Chain):
+    structural_attrs = ["resnet_sizes", "latent_dim", "output_channels"]
+
+    def __init__(self, device: Device | str | None = None, dtype: DType | None = None) -> None:
+        self.resnet_sizes: list[int] = [128, 256, 512, 512, 512]
+        self.latent_dim: int = 4
+        self.output_channels: int = 3
+        resnet_sizes = self.resnet_sizes[::-1]
+        resnet_layers: list[Chain] = [
+            (
+                Chain(
+                    [
+                        Resnet(
+                            in_channels=resnet_sizes[i - 1] if i > 0 else resnet_sizes[0],
+                            out_channels=resnet_sizes[i],
+                            device=device,
+                            dtype=dtype,
+                        ),
+                        Resnet(in_channels=resnet_sizes[i], out_channels=resnet_sizes[i], device=device, dtype=dtype),
+                        Resnet(in_channels=resnet_sizes[i], out_channels=resnet_sizes[i], device=device, dtype=dtype),
+                    ]
+                )
+                if i > 0
+                else Chain(
+                    [
+                        Resnet(in_channels=resnet_sizes[0], out_channels=resnet_sizes[i], device=device, dtype=dtype),
+                        Resnet(in_channels=resnet_sizes[i], out_channels=resnet_sizes[i], device=device, dtype=dtype),
+                    ]
+                )
+            )
+            for i in range(len(resnet_sizes))
+        ]
+        attention_layer = Sum(
+            Identity(),
+            Chain(
+                GroupNorm(channels=resnet_sizes[0], num_groups=32, eps=1e-6, device=device, dtype=dtype),
+                SelfAttention2d(channels=resnet_sizes[0], device=device, dtype=dtype),
+            ),
+        )
+        resnet_layers[0].insert(1, attention_layer)
+        for _, layer in zip(range(3), resnet_layers[1:]):
+            channels: int = layer[-1].out_channels
+            layer.insert(-1, Upsample(channels=channels, upsample_factor=2, device=device, dtype=dtype))
+        super().__init__(
+            Conv2d(
+                in_channels=self.latent_dim, out_channels=self.latent_dim, kernel_size=1, device=device, dtype=dtype
+            ),
+            Conv2d(
+                in_channels=self.latent_dim,
+                out_channels=resnet_sizes[0],
+                kernel_size=3,
+                padding=1,
+                device=device,
+                dtype=dtype,
+            ),
+            Chain(*resnet_layers),
+            Chain(
+                GroupNorm(channels=resnet_sizes[-1], num_groups=32, eps=1e-6, device=device, dtype=dtype),
+                SiLU(),
+                Conv2d(
+                    in_channels=resnet_sizes[-1],
+                    out_channels=self.output_channels,
+                    kernel_size=3,
+                    padding=1,
+                    device=device,
+                    dtype=dtype,
+                ),
+            ),
+        )
+
+
+class LatentDiffusionAutoencoder(Chain):
+    structural_attrs = ["encoder_scale"]
+
+    def __init__(
+        self,
+        device: Device | str | None = None,
+        dtype: DType | None = None,
+    ) -> None:
+        self.encoder_scale: float = 0.18215
+        super().__init__(
+            Encoder(device=device, dtype=dtype),
+            Decoder(device=device, dtype=dtype),
+        )
+
+    def encode(self, x: Tensor) -> Tensor:
+        encoder = self[0]
+        x = self.encoder_scale * encoder(x)
+        return x
+
+    def decode(self, x: Tensor) -> Tensor:
+        decoder = self[1]
+        x = decoder(x / self.encoder_scale)
+        return x
+
+    def encode_image(self, image: Image.Image) -> Tensor:
+        x = image_to_tensor(image, device=self.device, dtype=self.dtype)
+        x = 2 * x - 1
+        return self.encode(x)
+
+    def decode_latents(self, x: Tensor) -> Image.Image:
+        x = self.decode(x)
+        x = (x + 1) / 2
+        return tensor_to_image(x)
--- a/src/refiners/foundationals/latent_diffusion/controlnet.py
+++ b/src/refiners/foundationals/latent_diffusion/controlnet.py
@ -0,0 +1,150 @@
+from refiners.fluxion.context import Contexts
+from refiners.fluxion.layers import Chain, Conv2d, SiLU, Lambda, Passthrough, UseContext, Sum, Identity
+from refiners.foundationals.latent_diffusion.unet import DownBlocks, MiddleBlock, ResidualBlock, TimestepEncoder
+from refiners.adapters.range_adapter import RangeAdapter2d
+from typing import cast, Iterable
+from torch import Tensor, device as Device, dtype as DType
+
+
+class ConditionEncoder(Chain):
+    """Encode an image to be used as a condition for Controlnet.
+
+    Input is a `batch 3 width height` tensor, output is a `batch 320 width//8 height//8` tensor.
+    """
+
+    structural_attrs = ["out_channels"]
+
+    def __init__(self, device: Device | str | None = None, dtype: DType | None = None) -> None:
+        self.out_channels = (16, 32, 96, 256)
+        super().__init__(
+            Chain(
+                Conv2d(
+                    in_channels=3,
+                    out_channels=self.out_channels[0],
+                    kernel_size=3,
+                    stride=1,
+                    padding=1,
+                    device=device,
+                    dtype=dtype,
+                ),
+                SiLU(),
+            ),
+            *(
+                Chain(
+                    Conv2d(
+                        in_channels=self.out_channels[i],
+                        out_channels=self.out_channels[i],
+                        kernel_size=3,
+                        padding=1,
+                        device=device,
+                        dtype=dtype,
+                    ),
+                    SiLU(),
+                    Conv2d(
+                        in_channels=self.out_channels[i],
+                        out_channels=self.out_channels[i + 1],
+                        kernel_size=3,
+                        stride=2,
+                        padding=1,
+                        device=device,
+                        dtype=dtype,
+                    ),
+                    SiLU(),
+                )
+                for i in range(len(self.out_channels) - 1)
+            ),
+            Conv2d(
+                in_channels=self.out_channels[-1],
+                out_channels=320,
+                kernel_size=3,
+                padding=1,
+                device=device,
+                dtype=dtype,
+            ),
+        )
+
+
+class Controlnet(Passthrough):
+    structural_attrs = ["name", "scale"]
+
+    def __init__(self, name: str, device: Device | str | None = None, dtype: DType | None = None) -> None:
+        """Controlnet is a Half-UNet that collects residuals from the UNet and uses them to condition the UNet.
+
+        Input is a `batch 3 width height` tensor, output is a `batch 1280 width//8 height//8` tensor with residuals
+        stored in the context.
+
+        It has to use the same context as the UNet: `unet` and `sampling`.
+        """
+        self.name = name
+        self.scale: float = 1.0
+        super().__init__(
+            TimestepEncoder(context_key=f"timestep_embedding_{name}", device=device, dtype=dtype),
+            Lambda(lambda x: x.narrow(dim=1, start=0, length=4)),  # support inpainting
+            DownBlocks(in_channels=4, device=device, dtype=dtype),
+            MiddleBlock(device=device, dtype=dtype),
+        )
+
+        # We run the condition encoder at each step. Caching the result
+        # is not worth it as subsequent runs take virtually no time (FG-374).
+        self.DownBlocks[0].append(
+            Sum(
+                Identity(),
+                Chain(
+                    UseContext("controlnet", f"condition_{name}"),
+                    ConditionEncoder(device=device, dtype=dtype),
+                ),
+            ),
+        )
+        for residual_block in self.layers(ResidualBlock):
+            chain = residual_block.Chain
+            range_adapter = RangeAdapter2d(
+                target=chain.Conv2d_1,
+                channels=residual_block.out_channels,
+                embedding_dim=1280,
+                context_key=f"timestep_embedding_{self.name}",
+                device=device,
+                dtype=dtype,
+            )
+            range_adapter.inject(chain)
+        for n, block in enumerate(cast(Iterable[Chain], self.DownBlocks)):
+            assert hasattr(block[0], "out_channels"), (
+                "The first block of every subchain in DownBlocks is expected to respond to `out_channels`,"
+                f" {block[0]} does not."
+            )
+            out_channels: int = block[0].out_channels
+            block.append(
+                Passthrough(
+                    Conv2d(
+                        in_channels=out_channels, out_channels=out_channels, kernel_size=1, device=device, dtype=dtype
+                    ),
+                    Lambda(self._store_nth_residual(n)),
+                )
+            )
+        self.MiddleBlock.append(
+            Passthrough(
+                Conv2d(in_channels=1280, out_channels=1280, kernel_size=1, device=device, dtype=dtype),
+                Lambda(self._store_nth_residual(12)),
+            )
+        )
+
+    def init_context(self) -> Contexts:
+        return {
+            "unet": {"residuals": [0.0] * 13},
+            "sampling": {"shapes": []},
+            "controlnet": {f"condition_{self.name}": None},
+            "range_adapter": {f"timestep_embedding_{self.name}": None},
+        }
+
+    def _store_nth_residual(self, n: int):
+        def _store_residual(x: Tensor):
+            residuals = self.use_context("unet")["residuals"]
+            residuals[n] = residuals[n] + x * self.scale
+            return x
+
+        return _store_residual
+
+    def set_controlnet_condition(self, condition: Tensor) -> None:
+        self.set_context("controlnet", {f"condition_{self.name}": condition})
+
+    def set_scale(self, scale: float) -> None:
+        self.scale = scale
--- a/src/refiners/foundationals/latent_diffusion/cross_attention.py
+++ b/src/refiners/foundationals/latent_diffusion/cross_attention.py
@ -0,0 +1,203 @@
+from torch import Tensor, Size, device as Device, dtype as DType
+
+from refiners.fluxion.context import Contexts
+from refiners.fluxion.layers import (
+    Identity,
+    Flatten,
+    Unflatten,
+    Transpose,
+    Chain,
+    Parallel,
+    LayerNorm,
+    Attention,
+    Sum,
+    UseContext,
+    Linear,
+    GLU,
+    GeLU,
+    GroupNorm,
+    Conv2d,
+    SelfAttention,
+    SetContext,
+)
+
+
+class CrossAttentionBlock(Chain):
+    structural_attrs = ["embedding_dim", "context_embedding_dim", "context", "context_key", "num_heads", "use_bias"]
+
+    def __init__(
+        self,
+        embedding_dim: int,
+        context_embedding_dim: int,
+        context_key: str,
+        num_heads: int = 1,
+        use_bias: bool = True,
+        device: Device | str | None = None,
+        dtype: DType | None = None,
+    ) -> None:
+        self.embedding_dim = embedding_dim
+        self.context_embedding_dim = context_embedding_dim
+        self.context = "cross_attention_block"
+        self.context_key = context_key
+        self.num_heads = num_heads
+        self.use_bias = use_bias
+
+        super().__init__(
+            Sum(
+                Identity(),
+                Chain(
+                    LayerNorm(normalized_shape=embedding_dim, device=device, dtype=dtype),
+                    SelfAttention(
+                        embedding_dim=embedding_dim, num_heads=num_heads, use_bias=use_bias, device=device, dtype=dtype
+                    ),
+                ),
+            ),
+            Sum(
+                Identity(),
+                Chain(
+                    LayerNorm(normalized_shape=embedding_dim, device=device, dtype=dtype),
+                    Parallel(
+                        Identity(),
+                        UseContext(context=self.context, key=context_key),
+                        UseContext(context=self.context, key=context_key),
+                    ),
+                    Attention(
+                        embedding_dim=embedding_dim,
+                        num_heads=num_heads,
+                        key_embedding_dim=context_embedding_dim,
+                        value_embedding_dim=context_embedding_dim,
+                        use_bias=use_bias,
+                        device=device,
+                        dtype=dtype,
+                    ),
+                ),
+            ),
+            Sum(
+                Identity(),
+                Chain(
+                    LayerNorm(normalized_shape=embedding_dim, device=device, dtype=dtype),
+                    Linear(in_features=embedding_dim, out_features=2 * 4 * embedding_dim, device=device, dtype=dtype),
+                    GLU(GeLU()),
+                    Linear(in_features=4 * embedding_dim, out_features=embedding_dim, device=device, dtype=dtype),
+                ),
+            ),
+        )
+
+
+class StatefulFlatten(Chain):
+    structural_attrs = ["start_dim", "end_dim"]
+
+    def __init__(self, context: str, key: str, start_dim: int = 0, end_dim: int = -1) -> None:
+        self.start_dim = start_dim
+        self.end_dim = end_dim
+
+        super().__init__(
+            SetContext(context=context, key=key, callback=self.push),
+            Flatten(start_dim=start_dim, end_dim=end_dim),
+        )
+
+    def push(self, sizes: list[Size], x: Tensor) -> None:
+        sizes.append(
+            x.shape[slice(self.start_dim, self.end_dim + 1 if self.end_dim >= 0 else x.ndim + self.end_dim + 1)]
+        )
+
+
+class CrossAttentionBlock2d(Sum):
+    structural_attrs = [
+        "channels",
+        "in_channels",
+        "out_channels",
+        "context_embedding_dim",
+        "num_attention_heads",
+        "num_attention_layers",
+        "num_groups",
+        "context_key",
+        "use_linear_projection",
+        "projection_type",
+    ]
+
+    def __init__(
+        self,
+        channels: int,
+        context_embedding_dim: int,
+        context_key: str,
+        num_attention_heads: int = 1,
+        num_attention_layers: int = 1,
+        num_groups: int = 32,
+        use_bias: bool = True,
+        use_linear_projection: bool = False,
+        device: Device | str | None = None,
+        dtype: DType | None = None,
+    ) -> None:
+        assert channels % num_attention_heads == 0, "in_channels must be divisible by num_attention_heads"
+        self.channels = channels
+        self.in_channels = channels
+        self.out_channels = channels
+        self.context_embedding_dim = context_embedding_dim
+        self.num_attention_heads = num_attention_heads
+        self.num_attention_layers = num_attention_layers
+        self.num_groups = num_groups
+        self.context_key = context_key
+        self.use_linear_projection = use_linear_projection
+        self.projection_type = "Linear" if use_linear_projection else "Conv2d"
+
+        in_block = (
+            Chain(
+                GroupNorm(channels=channels, num_groups=num_groups, eps=1e-6, affine=True, device=device, dtype=dtype),
+                StatefulFlatten(context="flatten", key="sizes", start_dim=2),
+                Transpose(1, 2),
+                Linear(in_features=channels, out_features=channels, device=device, dtype=dtype),
+            )
+            if use_linear_projection
+            else Chain(
+                GroupNorm(channels=channels, num_groups=num_groups, eps=1e-6, affine=True, device=device, dtype=dtype),
+                Conv2d(in_channels=channels, out_channels=channels, kernel_size=1, device=device, dtype=dtype),
+                StatefulFlatten(context="flatten", key="sizes", start_dim=2),
+                Transpose(1, 2),
+            )
+        )
+
+        out_block = (
+            Chain(
+                Linear(in_features=channels, out_features=channels, device=device, dtype=dtype),
+                Transpose(1, 2),
+                Parallel(
+                    Identity(),
+                    UseContext(context="flatten", key="sizes").compose(lambda x: x.pop()),
+                ),
+                Unflatten(dim=2),
+            )
+            if use_linear_projection
+            else Chain(
+                Transpose(1, 2),
+                Parallel(
+                    Identity(),
+                    UseContext(context="flatten", key="sizes").compose(lambda x: x.pop()),
+                ),
+                Unflatten(dim=2),
+                Conv2d(in_channels=channels, out_channels=channels, kernel_size=1, device=device, dtype=dtype),
+            )
+        )
+
+        super().__init__(
+            Identity(),
+            Chain(
+                in_block,
+                Chain(
+                    CrossAttentionBlock(
+                        embedding_dim=channels,
+                        context_embedding_dim=context_embedding_dim,
+                        context_key=context_key,
+                        num_heads=num_attention_heads,
+                        use_bias=use_bias,
+                        device=device,
+                        dtype=dtype,
+                    )
+                    for _ in range(num_attention_layers)
+                ),
+                out_block,
+            ),
+        )
+
+    def init_context(self) -> Contexts:
+        return {"flatten": {"sizes": []}}
--- a/src/refiners/foundationals/latent_diffusion/lora.py
+++ b/src/refiners/foundationals/latent_diffusion/lora.py
@ -0,0 +1,101 @@
+from enum import Enum
+from pathlib import Path
+
+from torch import Tensor, device as Device
+from torch.nn import Parameter as TorchParameter
+from refiners.adapters.lora import LoraAdapter, load_lora_weights
+from refiners.foundationals.clip.text_encoder import FeedForward, TransformerLayer
+from refiners.foundationals.latent_diffusion.cross_attention import CrossAttentionBlock2d
+from refiners.foundationals.latent_diffusion import StableDiffusion_1
+from refiners.foundationals.latent_diffusion.controlnet import Controlnet
+import refiners.fluxion.layers as fl
+from refiners.fluxion.utils import load_from_safetensors, load_metadata_from_safetensors
+
+
+class LoraTarget(str, Enum):
+    Self = "self"
+    Attention = "Attention"
+    SelfAttention = "SelfAttention"
+    CrossAttention = "CrossAttentionBlock2d"
+    FeedForward = "FeedForward"
+    TransformerLayer = "TransformerLayer"
+
+    def get_class(self) -> type[fl.Chain]:
+        match self:
+            case LoraTarget.Self:
+                return fl.Chain
+            case LoraTarget.Attention:
+                return fl.Attention
+            case LoraTarget.SelfAttention:
+                return fl.SelfAttention
+            case LoraTarget.CrossAttention:
+                return CrossAttentionBlock2d
+            case LoraTarget.FeedForward:
+                return FeedForward
+            case LoraTarget.TransformerLayer:
+                return TransformerLayer
+
+
+def get_lora_rank(weights: list[Tensor]) -> int:
+    ranks: set[int] = {w.shape[1] for w in weights[0::2]}
+    assert len(ranks) == 1
+    return ranks.pop()
+
+
+def apply_loras_to_target(module: fl.Chain, target: LoraTarget, rank: int, scale: float) -> None:
+    for layer in module.layers(layer_type=target.get_class()):
+        for linear, parent in layer.walk(fl.Linear):
+            adapter = LoraAdapter(
+                target=linear,
+                rank=rank,
+                scale=scale,
+                device=module.device,
+                dtype=module.dtype,
+            )
+            adapter.inject(parent)
+
+
+class LoraWeights:
+    """A single LoRA weights training checkpoint used to patch a Stable Diffusion 1.5 model."""
+
+    metadata: dict[str, str] | None
+    tensors: dict[str, Tensor]
+
+    def __init__(self, checkpoint_path: Path | str, device: Device | str):
+        self.metadata = load_metadata_from_safetensors(checkpoint_path)
+        self.tensors = load_from_safetensors(checkpoint_path, device=device)
+
+    def patch(self, sd: StableDiffusion_1, scale: float = 1.0) -> None:
+        assert self.metadata is not None, "Invalid safetensors checkpoint: missing metadata"
+
+        for meta_key, meta_value in self.metadata.items():
+            match meta_key:
+                case "unet_targets":
+                    # TODO: support this transparently
+                    if any([isinstance(module, Controlnet) for module in sd.unet]):
+                        raise NotImplementedError("Cannot patch a UNet which already contains a Controlnet adapter")
+                    model = sd.unet
+                    key_prefix = "unet."
+                case "text_encoder_targets":
+                    model = sd.clip_text_encoder
+                    key_prefix = "text_encoder."
+                case "lda_targets":
+                    model = sd.lda
+                    key_prefix = "lda."
+                case _:
+                    raise ValueError(f"Unexpected key in checkpoint metadata: {meta_key}")
+
+            # TODO(FG-487): support loading multiple LoRA-s
+            if any(model.layers(LoraAdapter)):
+                raise NotImplementedError(f"{model.__class__.__name__} already contains LoRA layers")
+
+            lora_weights = [w for w in [self.tensors[k] for k in sorted(self.tensors) if k.startswith(key_prefix)]]
+            assert len(lora_weights) % 2 == 0
+
+            rank = get_lora_rank(lora_weights)
+            for target in meta_value.split(","):
+                apply_loras_to_target(model, target=LoraTarget(target), rank=rank, scale=scale)
+
+            assert len(list(model.layers(LoraAdapter))) == (len(lora_weights) // 2)
+
+            load_lora_weights(model, [TorchParameter(w) for w in lora_weights])
--- a/src/refiners/foundationals/latent_diffusion/schedulers/init.py
+++ b/src/refiners/foundationals/latent_diffusion/schedulers/init.py
@ -0,0 +1,11 @@
+from refiners.foundationals.latent_diffusion.schedulers.scheduler import Scheduler
+from refiners.foundationals.latent_diffusion.schedulers.dpm_solver import DPMSolver
+from refiners.foundationals.latent_diffusion.schedulers.ddpm import DDPM
+from refiners.foundationals.latent_diffusion.schedulers.ddim import DDIM
+
+__all__ = [
+    "Scheduler",
+    "DPMSolver",
+    "DDPM",
+    "DDIM",
+]
--- a/src/refiners/foundationals/latent_diffusion/schedulers/ddim.py
+++ b/src/refiners/foundationals/latent_diffusion/schedulers/ddim.py
@ -0,0 +1,41 @@
+from torch import Tensor, device as Device, arange, sqrt
+from refiners.foundationals.latent_diffusion.schedulers.scheduler import Scheduler
+
+
+class DDIM(Scheduler):
+    def __init__(
+        self,
+        num_inference_steps: int,
+        num_train_timesteps: int = 1_000,
+        initial_diffusion_rate: float = 8.5e-4,
+        final_diffusion_rate: float = 1.2e-2,
+        device: Device | str = "cpu",
+    ) -> None:
+        super().__init__(num_inference_steps, num_train_timesteps, initial_diffusion_rate, final_diffusion_rate, device)
+        self.timesteps = self._generate_timesteps()
+
+    def _generate_timesteps(self) -> Tensor:
+        """
+        Generates decreasing timesteps with 'leading' spacing and offset of 1
+        similar to diffusers settings for the DDIM scheduler in Stable Diffusion 1.5
+        """
+        step_ratio = self.num_train_timesteps // self.num_inference_steps
+        timesteps = arange(start=0, end=self.num_inference_steps, step=1) * step_ratio + 1
+        return timesteps.flip(0)
+
+    def __call__(self, x: Tensor, noise: Tensor, step: int) -> Tensor:
+        timestep, previous_timestep = (
+            self.timesteps[step],
+            self.timesteps[step] - self.num_train_timesteps // self.num_inference_steps,
+        )
+        current_scale_factor, previous_scale_factor = self.cumulative_scale_factors[timestep], (
+            self.cumulative_scale_factors[previous_timestep]
+            if previous_timestep > 0
+            else self.cumulative_scale_factors[0]
+        )
+        predicted_x = (x - sqrt(1 - current_scale_factor**2) * noise) / current_scale_factor
+        denoised_x = previous_scale_factor * predicted_x + sqrt(1 - previous_scale_factor**2) * noise
+
+        self.previous_scale_factor = previous_scale_factor
+
+        return denoised_x
--- a/src/refiners/foundationals/latent_diffusion/schedulers/ddpm.py
+++ b/src/refiners/foundationals/latent_diffusion/schedulers/ddpm.py
@ -0,0 +1,75 @@
+from torch import Tensor, device as Device, randn, arange, Generator, tensor
+from refiners.foundationals.latent_diffusion.schedulers.scheduler import Scheduler
+
+
+class DDPM(Scheduler):
+    """
+    The Denoising Diffusion Probabilistic Models (DDPM) is a specific type of diffusion model,
+    which uses a specific strategy to generate the timesteps and applies the diffusion process in a specific way.
+    """
+
+    def __init__(
+        self,
+        num_inference_steps: int,
+        num_train_timesteps: int = 1_000,
+        initial_diffusion_rate: float = 8.5e-4,
+        final_diffusion_rate: float = 1.2e-2,
+        device: Device | str = "cpu",
+    ) -> None:
+        super().__init__(num_inference_steps, num_train_timesteps, initial_diffusion_rate, final_diffusion_rate, device)
+
+    def _generate_timesteps(self) -> Tensor:
+        step_ratio = self.num_train_timesteps // self.num_inference_steps
+        timesteps = arange(start=0, end=self.num_inference_steps, step=1) * step_ratio
+        return timesteps.flip(0)
+
+    def __call__(self, x: Tensor, noise: Tensor, step: int, generator: Generator | None = None) -> Tensor:
+        """
+        Generate the next step in the diffusion process.
+
+        This method adjusts the input data using added noise and an estimate of the denoised data, based on the current
+        step in the diffusion process. This adjusted data forms the next step in the diffusion process.
+
+        1. It uses current and previous timesteps to calculate the current factor dictating the contribution of original
+        data and noise to the new step.
+        2. An estimate of the denoised data (`estimated_denoised_data`) is generated.
+        3. It calculates coefficients for the estimated denoised data and current data (`original_data_coeff` and
+        `current_data_coeff`) that balance their contribution to the denoised data for the next step.
+        4. It calculates the denoised data for the next step (`denoised_x`), which is a combination of the estimated
+        denoised data and current data, adjusted by their respective coefficients.
+        5. Noise is then added to `denoised_x`. The magnitude of noise is controlled by a calculated variance based on
+        the cumulative scaling factor and the current factor.
+
+        The output is the new data step for the next stage in the diffusion process.
+        """
+        timestep, previous_timestep = (
+            self.timesteps[step],
+            (
+                self.timesteps[step + 1]
+                if step < len(self.timesteps) - 1
+                else tensor(-(self.num_train_timesteps // self.num_inference_steps), device=self.device)
+            ),
+        )
+        current_cumulative_factor, previous_cumulative_scale_factor = (self.scale_factors.cumprod(0))[timestep], (
+            (self.scale_factors.cumprod(0))[previous_timestep]
+            if step < len(self.timesteps) - 1
+            else tensor(1, device=self.device)
+        )
+        current_factor = current_cumulative_factor / previous_cumulative_scale_factor
+        estimated_denoised_data = (
+            x - (1 - current_cumulative_factor) ** 0.5 * noise
+        ) / current_cumulative_factor**0.5
+        estimated_denoised_data = estimated_denoised_data.clamp(-1, 1)
+        original_data_coeff = (previous_cumulative_scale_factor**0.5 * (1 - current_factor)) / (
+            1 - current_cumulative_factor
+        )
+        current_data_coeff = (
+            current_factor**0.5 * (1 - previous_cumulative_scale_factor) / (1 - current_cumulative_factor)
+        )
+        denoised_x = original_data_coeff * estimated_denoised_data + current_data_coeff * x
+        if step < len(self.timesteps) - 1:
+            variance = (1 - previous_cumulative_scale_factor) / (1 - current_cumulative_factor) * (1 - current_factor)
+            denoised_x = denoised_x + (variance.clamp(min=1e-20) ** 0.5) * randn(
+                x.shape, device=x.device, dtype=x.dtype, generator=generator
+            )
+        return denoised_x
--- a/src/refiners/foundationals/latent_diffusion/schedulers/dpm_solver.py
+++ b/src/refiners/foundationals/latent_diffusion/schedulers/dpm_solver.py
@ -0,0 +1,111 @@
+from refiners.foundationals.latent_diffusion.schedulers.scheduler import Scheduler
+import numpy as np
+from torch import Tensor, device as Device, tensor, exp
+from collections import deque
+
+
+class DPMSolver(Scheduler):
+    """Implements DPM-Solver++ from https://arxiv.org/abs/2211.01095
+
+    We only support noise prediction for now.
+    """
+
+    def __init__(
+        self,
+        num_inference_steps: int,
+        num_train_timesteps: int = 1_000,
+        initial_diffusion_rate: float = 8.5e-4,
+        final_diffusion_rate: float = 1.2e-2,
+        device: Device | str = "cpu",
+    ):
+        super().__init__(
+            num_inference_steps=num_inference_steps,
+            num_train_timesteps=num_train_timesteps,
+            initial_diffusion_rate=initial_diffusion_rate,
+            final_diffusion_rate=final_diffusion_rate,
+            device=device,
+        )
+        self.estimated_data = deque([tensor([])] * 2, maxlen=2)
+        self.initial_steps = 0
+
+    def _generate_timesteps(self) -> Tensor:
+        # We need to use numpy here because:
+        # numpy.linspace(0,999,31)[15] is 499.49999999999994
+        # torch.linspace(0,999,31)[15] is 499.5
+        # ...and we want the same result as the original codebase.
+        return tensor(
+            np.linspace(0, self.num_train_timesteps - 1, self.num_inference_steps + 1).round().astype(int)[1:],
+            device=self.device,
+        ).flip(0)
+
+    def dpm_solver_first_order_update(self, x: Tensor, noise: Tensor, step: int) -> Tensor:
+        timestep, previous_timestep = (
+            self.timesteps[step],
+            self.timesteps[step + 1 if step < len(self.timesteps) - 1 else 0],
+        )
+        previous_ratio, current_ratio = (
+            self.signal_to_noise_ratios[previous_timestep],
+            self.signal_to_noise_ratios[timestep],
+        )
+        previous_scale_factor = self.cumulative_scale_factors[previous_timestep]
+        previous_noise_std, current_noise_std = (
+            self.noise_std[previous_timestep],
+            self.noise_std[timestep],
+        )
+        exp_factor = exp(-(previous_ratio - current_ratio))
+        denoised_x = (previous_noise_std / current_noise_std) * x - (previous_scale_factor * (exp_factor - 1.0)) * noise
+        return denoised_x
+
+    def multistep_dpm_solver_second_order_update(self, x: Tensor, step: int) -> Tensor:
+        previous_timestep, current_timestep, next_timestep = (
+            self.timesteps[step + 1] if step < len(self.timesteps) - 1 else tensor([0]),
+            self.timesteps[step],
+            self.timesteps[step - 1],
+        )
+        current_data_estimation, next_data_estimation = self.estimated_data[-1], self.estimated_data[-2]
+        previous_ratio, current_ratio, next_ratio = (
+            self.signal_to_noise_ratios[previous_timestep],
+            self.signal_to_noise_ratios[current_timestep],
+            self.signal_to_noise_ratios[next_timestep],
+        )
+        previous_scale_factor = self.cumulative_scale_factors[previous_timestep]
+        previous_std, current_std = (
+            self.noise_std[previous_timestep],
+            self.noise_std[current_timestep],
+        )
+        estimation_delta = (current_data_estimation - next_data_estimation) / (
+            (current_ratio - next_ratio) / (previous_ratio - current_ratio)
+        )
+        exp_neg_factor = exp(-(previous_ratio - current_ratio))
+        x_t = (
+            (previous_std / current_std) * x
+            - (previous_scale_factor * (exp_neg_factor - 1.0)) * current_data_estimation
+            - 0.5 * (previous_scale_factor * (exp_neg_factor - 1.0)) * estimation_delta
+        )
+        return x_t
+
+    def __call__(
+        self,
+        x: Tensor,
+        noise: Tensor,
+        step: int,
+    ) -> Tensor:
+        """
+        Represents one step of the backward diffusion process that iteratively denoises the input data `x`.
+
+        This method works by estimating the denoised version of `x` and applying either a first-order or second-order
+        backward Euler update, which is a numerical method commonly used to solve ordinary differential equations
+        (ODEs).
+        """
+        current_timestep = self.timesteps[step]
+        scale_factor, noise_ratio = self.cumulative_scale_factors[current_timestep], self.noise_std[current_timestep]
+        estimated_denoised_data = (x - noise_ratio * noise) / scale_factor
+        self.estimated_data.append(estimated_denoised_data)
+        denoised_x = (
+            self.dpm_solver_first_order_update(x=x, noise=estimated_denoised_data, step=step)
+            if (self.initial_steps == 0)
+            else self.multistep_dpm_solver_second_order_update(x=x, step=step)
+        )
+        if self.initial_steps < 2:
+            self.initial_steps += 1
+        return denoised_x
--- a/src/refiners/foundationals/latent_diffusion/schedulers/scheduler.py
+++ b/src/refiners/foundationals/latent_diffusion/schedulers/scheduler.py
@ -0,0 +1,95 @@
+from abc import abstractmethod
+from torch import Tensor, device as Device, dtype as DType, linspace, float32, sqrt, log
+from typing import TypeVar
+
+T = TypeVar("T", bound="Scheduler")
+
+
+class Scheduler:
+    """
+    A base class for creating a diffusion model scheduler.
+
+    The Scheduler creates a sequence of noise and scaling factors used in the diffusion process,
+    which gradually transforms the original data distribution into a Gaussian one.
+
+    This process is described using several parameters such as initial and final diffusion rates,
+    and is encapsulated into a `__call__` method that applies a step of the diffusion process.
+    """
+
+    timesteps: Tensor
+
+    def __init__(
+        self,
+        num_inference_steps: int,
+        num_train_timesteps: int = 1_000,
+        initial_diffusion_rate: float = 8.5e-4,
+        final_diffusion_rate: float = 1.2e-2,
+        device: Device | str = "cpu",
+        dtype: DType = float32,
+    ):
+        self.device: Device = Device(device)
+        self.dtype: DType = dtype
+        self.num_inference_steps = num_inference_steps
+        self.num_train_timesteps = num_train_timesteps
+        self.initial_diffusion_rate = initial_diffusion_rate
+        self.final_diffusion_rate = final_diffusion_rate
+        self.scale_factors = (
+            1.0
+            - linspace(
+                start=initial_diffusion_rate**0.5,
+                end=final_diffusion_rate**0.5,
+                steps=num_train_timesteps,
+                dtype=dtype,
+            )
+            ** 2
+        )
+        self.cumulative_scale_factors = sqrt(self.scale_factors.cumprod(dim=0))
+        self.noise_std = sqrt(1.0 - self.scale_factors.cumprod(dim=0))
+        self.signal_to_noise_ratios = log(self.cumulative_scale_factors) - log(self.noise_std)
+        self.timesteps = self._generate_timesteps()
+
+    @abstractmethod
+    def __call__(self, x: Tensor, noise: Tensor, step: int) -> Tensor:
+        """
+        Applies a step of the diffusion process to the input tensor `x` using the provided `noise` and `timestep`.
+
+        This method should be overridden by subclasses to implement the specific diffusion process.
+        """
+        ...
+
+    @abstractmethod
+    def _generate_timesteps(self) -> Tensor:
+        """
+        Generates a tensor of timesteps.
+
+        This method should be overridden by subclasses to provide the specific timesteps for the diffusion process.
+        """
+        ...
+
+    @property
+    def steps(self) -> list[int]:
+        return list(range(self.num_inference_steps))
+
+    def add_noise(
+        self,
+        x: Tensor,
+        noise: Tensor,
+        step: int,
+    ) -> Tensor:
+        timestep = self.timesteps[step]
+        cumulative_scale_factors = self.cumulative_scale_factors[timestep].unsqueeze(-1).unsqueeze(-1)
+        noise_stds = self.noise_std[timestep].unsqueeze(-1).unsqueeze(-1)
+        noised_x = cumulative_scale_factors * x + noise_stds * noise
+        return noised_x
+
+    def to(self: T, device: Device | str | None = None, dtype: DType | None = None) -> T:  # type: ignore
+        if device is not None:
+            self.device = Device(device)
+            self.timesteps = self.timesteps.to(device)
+        if dtype is not None:
+            self.dtype = dtype
+        self.scale_factors = self.scale_factors.to(device, dtype=dtype)
+        self.cumulative_scale_factors = self.cumulative_scale_factors.to(device, dtype=dtype)
+        self.noise_std = self.noise_std.to(device, dtype=dtype)
+        self.signal_to_noise_ratios = self.signal_to_noise_ratios.to(device, dtype=dtype)
+        return self
--- a/src/refiners/foundationals/latent_diffusion/sdxl_unet.py
+++ b/src/refiners/foundationals/latent_diffusion/sdxl_unet.py
@ -0,0 +1,291 @@
+from typing import cast
+from torch import Tensor, device as Device, dtype as DType
+from refiners.fluxion.context import Contexts
+import refiners.fluxion.layers as fl
+from refiners.foundationals.latent_diffusion.cross_attention import CrossAttentionBlock2d
+from refiners.foundationals.latent_diffusion.unet import ResidualAccumulator, ResidualBlock, ResidualConcatenator
+from refiners.adapters.range_adapter import RangeAdapter2d, RangeEncoder, compute_sinusoidal_embedding
+
+
+class TextTimeEmbedding(fl.Chain):
+    structural_attrs = ["timestep_embedding_dim", "time_ids_embedding_dim", "text_time_embedding_dim"]
+
+    def __init__(self, device: Device | str | None = None, dtype: DType | None = None) -> None:
+        self.timestep_embedding_dim = 1280
+        self.time_ids_embedding_dim = 256
+        self.text_time_embedding_dim = 2816
+        super().__init__(
+            fl.Concatenate(
+                fl.UseContext(context="diffusion", key="pooled_text_embedding"),
+                fl.Chain(
+                    fl.UseContext(context="diffusion", key="time_ids"),
+                    fl.Unsqueeze(dim=-1),
+                    fl.Lambda(func=self.compute_sinuosoidal_embedding),
+                    fl.Reshape(-1),
+                ),
+                dim=1,
+            ),
+            fl.Linear(
+                in_features=self.text_time_embedding_dim,
+                out_features=self.timestep_embedding_dim,
+                device=device,
+                dtype=dtype,
+            ),
+            fl.SiLU(),
+            fl.Linear(
+                in_features=self.timestep_embedding_dim,
+                out_features=self.timestep_embedding_dim,
+                device=device,
+                dtype=dtype,
+            ),
+        )
+
+    def compute_sinuosoidal_embedding(self, x: Tensor) -> Tensor:
+        return compute_sinusoidal_embedding(x=x, embedding_dim=self.time_ids_embedding_dim).to(dtype=self.dtype)
+
+
+class TimestepEncoder(fl.Passthrough):
+    structural_attrs = ["timestep_embedding_dim"]
+
+    def __init__(self, device: Device | str | None = None, dtype: DType | None = None) -> None:
+        self.timestep_embedding_dim = 1280
+        super().__init__(
+            fl.Sum(
+                fl.Chain(
+                    fl.UseContext(context="diffusion", key="timestep"),
+                    RangeEncoder(
+                        sinuosidal_embedding_dim=320,
+                        embedding_dim=self.timestep_embedding_dim,
+                        device=device,
+                        dtype=dtype,
+                    ),
+                ),
+                TextTimeEmbedding(device=device, dtype=dtype),
+            ),
+            fl.SetContext(context="range_adapter", key="timestep_embedding"),
+        )
+
+
+class SDXLCrossAttention(CrossAttentionBlock2d):
+    structural_attrs = ["channels", "num_attention_layers", "num_attention_heads"]
+
+    def __init__(
+        self,
+        channels: int,
+        num_attention_layers: int = 1,
+        num_attention_heads: int = 10,
+        device: Device | str | None = None,
+        dtype: DType | None = None,
+    ) -> None:
+        super().__init__(
+            channels=channels,
+            context_embedding_dim=2048,
+            context_key="clip_text_embedding",
+            num_attention_layers=num_attention_layers,
+            num_attention_heads=num_attention_heads,
+            use_bias=False,
+            use_linear_projection=True,
+            device=device,
+            dtype=dtype,
+        )
+
+
+class DownBlocks(fl.Chain):
+    structural_attrs = ["in_channels"]
+
+    def __init__(self, in_channels: int, device: Device | str | None = None, dtype: DType | None = None) -> None:
+        self.in_channels = in_channels
+
+        in_block = fl.Chain(
+            fl.Conv2d(in_channels=in_channels, out_channels=320, kernel_size=3, padding=1, device=device, dtype=dtype)
+        )
+        first_blocks = [
+            fl.Chain(
+                ResidualBlock(in_channels=320, out_channels=320, device=device, dtype=dtype),
+            ),
+            fl.Chain(
+                ResidualBlock(in_channels=320, out_channels=320, device=device, dtype=dtype),
+            ),
+            fl.Chain(
+                fl.Downsample(channels=320, scale_factor=2, padding=1, device=device, dtype=dtype),
+            ),
+        ]
+        second_blocks = [
+            fl.Chain(
+                ResidualBlock(in_channels=320, out_channels=640, device=device, dtype=dtype),
+                SDXLCrossAttention(
+                    channels=640, num_attention_layers=2, num_attention_heads=10, device=device, dtype=dtype
+                ),
+            ),
+            fl.Chain(
+                ResidualBlock(in_channels=640, out_channels=640, device=device, dtype=dtype),
+                SDXLCrossAttention(
+                    channels=640, num_attention_layers=2, num_attention_heads=10, device=device, dtype=dtype
+                ),
+            ),
+            fl.Chain(
+                fl.Downsample(channels=640, scale_factor=2, padding=1, device=device, dtype=dtype),
+            ),
+        ]
+        third_blocks = [
+            fl.Chain(
+                ResidualBlock(in_channels=640, out_channels=1280, device=device, dtype=dtype),
+                SDXLCrossAttention(
+                    channels=1280, num_attention_layers=10, num_attention_heads=20, device=device, dtype=dtype
+                ),
+            ),
+            fl.Chain(
+                ResidualBlock(in_channels=1280, out_channels=1280, device=device, dtype=dtype),
+                SDXLCrossAttention(
+                    channels=1280, num_attention_layers=10, num_attention_heads=20, device=device, dtype=dtype
+                ),
+            ),
+        ]
+
+        super().__init__(
+            in_block,
+            *first_blocks,
+            *second_blocks,
+            *third_blocks,
+        )
+
+
+class UpBlocks(fl.Chain):
+    structural_attrs = []
+
+    def __init__(self, device: Device | str | None = None, dtype: DType | None = None) -> None:
+        first_blocks = [
+            fl.Chain(
+                ResidualBlock(in_channels=2560, out_channels=1280, device=device, dtype=dtype),
+                SDXLCrossAttention(
+                    channels=1280, num_attention_layers=10, num_attention_heads=20, device=device, dtype=dtype
+                ),
+            ),
+            fl.Chain(
+                ResidualBlock(in_channels=2560, out_channels=1280, device=device, dtype=dtype),
+                SDXLCrossAttention(
+                    channels=1280, num_attention_layers=10, num_attention_heads=20, device=device, dtype=dtype
+                ),
+            ),
+            fl.Chain(
+                ResidualBlock(in_channels=1920, out_channels=1280, device=device, dtype=dtype),
+                SDXLCrossAttention(
+                    channels=1280, num_attention_layers=10, num_attention_heads=20, device=device, dtype=dtype
+                ),
+                fl.Upsample(channels=1280, device=device, dtype=dtype),
+            ),
+        ]
+
+        second_blocks = [
+            fl.Chain(
+                ResidualBlock(in_channels=1920, out_channels=640, device=device, dtype=dtype),
+                SDXLCrossAttention(
+                    channels=640, num_attention_layers=2, num_attention_heads=10, device=device, dtype=dtype
+                ),
+            ),
+            fl.Chain(
+                ResidualBlock(in_channels=1280, out_channels=640, device=device, dtype=dtype),
+                SDXLCrossAttention(
+                    channels=640, num_attention_layers=2, num_attention_heads=10, device=device, dtype=dtype
+                ),
+            ),
+            fl.Chain(
+                ResidualBlock(in_channels=960, out_channels=640, device=device, dtype=dtype),
+                SDXLCrossAttention(
+                    channels=640, num_attention_layers=2, num_attention_heads=10, device=device, dtype=dtype
+                ),
+                fl.Upsample(channels=640, device=device, dtype=dtype),
+            ),
+        ]
+
+        third_blocks = [
+            fl.Chain(
+                ResidualBlock(in_channels=960, out_channels=320, device=device, dtype=dtype),
+            ),
+            fl.Chain(
+                ResidualBlock(in_channels=640, out_channels=320, device=device, dtype=dtype),
+            ),
+            fl.Chain(
+                ResidualBlock(in_channels=640, out_channels=320, device=device, dtype=dtype),
+            ),
+        ]
+
+        super().__init__(
+            *first_blocks,
+            *second_blocks,
+            *third_blocks,
+        )
+
+
+class MiddleBlock(fl.Chain):
+    structural_attrs = []
+
+    def __init__(self, device: Device | str | None = None, dtype: DType | None = None) -> None:
+        super().__init__(
+            ResidualBlock(in_channels=1280, out_channels=1280, device=device, dtype=dtype),
+            SDXLCrossAttention(
+                channels=1280, num_attention_layers=10, num_attention_heads=20, device=device, dtype=dtype
+            ),
+            ResidualBlock(in_channels=1280, out_channels=1280, device=device, dtype=dtype),
+        )
+
+
+class OutputBlock(fl.Chain):
+    structural_attrs = []
+
+    def __init__(self, device: Device | str | None = None, dtype: DType | None = None) -> None:
+        super().__init__(
+            fl.GroupNorm(channels=320, num_groups=32),
+            fl.SiLU(),
+            fl.Conv2d(in_channels=320, out_channels=4, kernel_size=3, stride=1, padding=1, device=device, dtype=dtype),
+        )
+
+
+class SDXLUNet(fl.Chain):
+    structural_attrs = ["in_channels"]
+
+    def __init__(self, in_channels: int, device: Device | str | None = None, dtype: DType | None = None) -> None:
+        self.in_channels = in_channels
+        super().__init__(
+            TimestepEncoder(device=device, dtype=dtype),
+            DownBlocks(in_channels=in_channels, device=device, dtype=dtype),
+            MiddleBlock(device=device, dtype=dtype),
+            fl.Residual(fl.UseContext(context="unet", key="residuals").compose(lambda x: x[-1])),
+            UpBlocks(device=device, dtype=dtype),
+            OutputBlock(device=device, dtype=dtype),
+        )
+        for residual_block in self.layers(ResidualBlock):
+            chain = residual_block.Chain
+            range_adapter = RangeAdapter2d(
+                target=chain.Conv2d_1,
+                channels=residual_block.out_channels,
+                embedding_dim=1280,
+                context_key="timestep_embedding",
+                device=device,
+                dtype=dtype,
+            )
+            range_adapter.inject(chain)
+        for n, block in enumerate(iterable=cast(list[fl.Chain], self.DownBlocks)):
+            block.append(module=ResidualAccumulator(n=n))
+        for n, block in enumerate(iterable=cast(list[fl.Chain], self.UpBlocks)):
+            block.insert(index=0, module=ResidualConcatenator(n=-n - 2))
+
+    def init_context(self) -> Contexts:
+        return {
+            "unet": {"residuals": [0.0] * 10},
+            "diffusion": {"timestep": None, "time_ids": None, "pooled_text_embedding": None},
+            "range_adapter": {"timestep_embedding": None},
+            "sampling": {"shapes": []},
+        }
+
+    def set_clip_text_embedding(self, clip_text_embedding: Tensor) -> None:
+        self.set_context(context="cross_attention_block", value={"clip_text_embedding": clip_text_embedding})
+
+    def set_timestep(self, timestep: Tensor) -> None:
+        self.set_context(context="diffusion", value={"timestep": timestep})
+
+    def set_time_ids(self, time_ids: Tensor) -> None:
+        self.set_context(context="diffusion", value={"time_ids": time_ids})
+
+    def set_pooled_text_embedding(self, pooled_text_embedding: Tensor) -> None:
+        self.set_context(context="diffusion", value={"pooled_text_embedding": pooled_text_embedding})
--- a/src/refiners/foundationals/latent_diffusion/self_attention_injection.py
+++ b/src/refiners/foundationals/latent_diffusion/self_attention_injection.py
@ -0,0 +1,130 @@
+from refiners.fluxion.layers import (
+    Passthrough,
+    Lambda,
+    Chain,
+    Concatenate,
+    UseContext,
+    SelfAttention,
+    SetContext,
+    Identity,
+    Parallel,
+)
+from refiners.adapters.adapter import Adapter
+from refiners.foundationals.latent_diffusion.unet import UNet
+from refiners.foundationals.latent_diffusion.cross_attention import CrossAttentionBlock
+from torch import Tensor
+
+
+class SaveLayerNormAdapter(Chain, Adapter[SelfAttention]):
+    def __init__(self, target: SelfAttention, context: str) -> None:
+        self.context = context
+        with self.setup_adapter(target):
+            super().__init__(SetContext(self.context, "norm"), target)
+
+
+class ReferenceOnlyControlAdapter(Chain, Adapter[SelfAttention]):
+    def __init__(
+        self,
+        target: SelfAttention,
+        context: str,
+        sai: "SelfAttentionInjection",
+    ) -> None:
+        self.context = context
+        self._sai = [sai]  # only to support setting `style_cfg` dynamically
+
+        sa_guided = target.structural_copy()
+        assert isinstance(sa_guided[0], Parallel)
+        sa_guided.replace(
+            sa_guided[0],
+            Parallel(
+                Identity(),
+                Concatenate(Identity(), UseContext(self.context, "norm"), dim=1),
+                Concatenate(Identity(), UseContext(self.context, "norm"), dim=1),
+            ),
+        )
+
+        with self.setup_adapter(target):
+            super().__init__(
+                Parallel(sa_guided, Chain(Lambda(lambda x: x[:1]), target)),
+                Lambda(self.compute_averaged_unconditioned_x),
+            )
+
+    def compute_averaged_unconditioned_x(self, x: Tensor, unguided_unconditioned_x: Tensor) -> Tensor:
+        style_cfg = self._sai[0].style_cfg
+        x[0] = style_cfg * x[0] + (1.0 - style_cfg) * unguided_unconditioned_x
+        return x
+
+
+class SelfAttentionInjection(Passthrough):
+    # TODO: Does not support batching yet. Assumes concatenated inputs for classifier-free guidance
+
+    def __init__(self, unet: UNet, style_cfg: float = 0.5) -> None:
+        # the style_cfg is the weight of the guide in unconditionned diffusion.
+        # This value is recommended to be 0.5 on the sdwebui repo.
+        self.style_cfg = style_cfg
+        self._adapters: list[ReferenceOnlyControlAdapter] = []
+        self._unet = [unet]
+
+        guide_unet = unet.structural_copy()
+        for i, attention_block in enumerate(guide_unet.layers(CrossAttentionBlock)):
+            sa = attention_block.find(SelfAttention)
+            assert sa is not None and sa.parent is not None
+            SaveLayerNormAdapter(sa, context=f"self_attention_context_{i}").inject()
+
+        for i, attention_block in enumerate(unet.layers(CrossAttentionBlock)):
+            unet.set_context(f"self_attention_context_{i}", {"norm": None})
+
+            sa = attention_block.find(SelfAttention)
+            assert sa is not None and sa.parent is not None
+
+            self._adapters.append(ReferenceOnlyControlAdapter(sa, context=f"self_attention_context_{i}", sai=self))
+
+        super().__init__(
+            Lambda(self.copy_diffusion_context),
+            UseContext("self_attention_injection", "guide"),
+            guide_unet,
+            Lambda(self.restore_diffusion_context),
+        )
+
+    @property
+    def unet(self):
+        return self._unet[0]
+
+    def inject(self) -> None:
+        assert self not in self._unet[0], f"{self} is already injected"
+        for adapter in self._adapters:
+            adapter.inject()
+        self.unet.insert(0, self)
+
+    def eject(self) -> None:
+        assert self.unet[0] == self, f"{self} is not the first element of target UNet"
+        for adapter in self._adapters:
+            adapter.eject()
+        self.unet.pop(0)
+
+    def set_controlnet_condition(self, condition: Tensor) -> None:
+        self.set_context("self_attention_injection", {"guide": condition})
+
+    def copy_diffusion_context(self, x: Tensor) -> Tensor:
+        # This function allows to not disrupt the accumulation of residuals in the unet (if controlnet are used)
+        self.set_context(
+            "self_attention_residuals_buffer",
+            {"buffer": self.use_context("unet")["residuals"]},
+        )
+        self.set_context(
+            "unet",
+            {"residuals": [0.0] * 13},
+        )
+        return x
+
+    def restore_diffusion_context(self, x: Tensor) -> Tensor:
+        self.set_context(
+            "unet",
+            {
+                "residuals": self.use_context("self_attention_residuals_buffer")["buffer"],
+            },
+        )
+        return x
+
+    def structural_copy(self: "SelfAttentionInjection") -> "SelfAttentionInjection":
+        raise RuntimeError("SelfAttentionInjection cannot be copied, eject it first.")
--- a/src/refiners/foundationals/latent_diffusion/unet.py
+++ b/src/refiners/foundationals/latent_diffusion/unet.py
@ -0,0 +1,307 @@
+from typing import cast, Iterable
+
+from torch import Tensor, device as Device, dtype as DType
+
+from refiners.fluxion.context import Contexts
+import refiners.fluxion.layers as fl
+
+from refiners.foundationals.latent_diffusion.cross_attention import CrossAttentionBlock2d
+from refiners.adapters.range_adapter import RangeEncoder, RangeAdapter2d
+
+
+class TimestepEncoder(fl.Passthrough):
+    def __init__(
+        self,
+        context_key: str = "timestep_embedding",
+        device: Device | str | None = None,
+        dtype: DType | None = None,
+    ) -> None:
+        super().__init__(
+            fl.UseContext("diffusion", "timestep"),
+            RangeEncoder(320, 1280, device=device, dtype=dtype),
+            fl.SetContext("range_adapter", context_key),
+        )
+
+
+class ResidualBlock(fl.Sum):
+    structural_attrs = ["in_channels", "out_channels", "num_groups", "eps"]
+
+    def __init__(
+        self,
+        in_channels: int,
+        out_channels: int,
+        num_groups: int = 32,
+        eps: float = 1e-5,
+        device: Device | str | None = None,
+        dtype: DType | None = None,
+    ) -> None:
+        if in_channels % num_groups != 0 or out_channels % num_groups != 0:
+            raise ValueError("Number of input and output channels must be divisible by num_groups.")
+        self.in_channels = in_channels
+        self.out_channels = out_channels
+        self.num_groups = num_groups
+        self.eps = eps
+        shortcut = (
+            fl.Conv2d(in_channels=in_channels, out_channels=out_channels, kernel_size=1, device=device, dtype=dtype)
+            if in_channels != out_channels
+            else fl.Identity()
+        )
+        super().__init__(
+            fl.Chain(
+                fl.GroupNorm(channels=in_channels, num_groups=num_groups, eps=eps, device=device, dtype=dtype),
+                fl.SiLU(),
+                fl.Conv2d(
+                    in_channels=in_channels,
+                    out_channels=out_channels,
+                    kernel_size=3,
+                    padding=1,
+                    device=device,
+                    dtype=dtype,
+                ),
+                fl.GroupNorm(channels=out_channels, num_groups=num_groups, eps=eps, device=device, dtype=dtype),
+                fl.SiLU(),
+                fl.Conv2d(
+                    in_channels=out_channels,
+                    out_channels=out_channels,
+                    kernel_size=3,
+                    padding=1,
+                    device=device,
+                    dtype=dtype,
+                ),
+            ),
+            shortcut,
+        )
+
+
+class CLIPLCrossAttention(CrossAttentionBlock2d):
+    def __init__(
+        self,
+        channels: int,
+        device: Device | str | None = None,
+        dtype: DType | None = None,
+    ) -> None:
+        super().__init__(
+            channels=channels,
+            context_embedding_dim=768,
+            context_key="clip_text_embedding",
+            num_attention_heads=8,
+            use_bias=False,
+            device=device,
+            dtype=dtype,
+        )
+
+
+class DownBlocks(fl.Chain):
+    structural_attrs = ["in_channels"]
+
+    def __init__(
+        self,
+        in_channels: int,
+        device: Device | str | None = None,
+        dtype: DType | None = None,
+    ):
+        self.in_channels = in_channels
+        super().__init__(
+            fl.Chain(
+                fl.Conv2d(
+                    in_channels=in_channels, out_channels=320, kernel_size=3, padding=1, device=device, dtype=dtype
+                )
+            ),
+            fl.Chain(
+                ResidualBlock(in_channels=320, out_channels=320, device=device, dtype=dtype),
+                CLIPLCrossAttention(channels=320, device=device, dtype=dtype),
+            ),
+            fl.Chain(
+                ResidualBlock(in_channels=320, out_channels=320, device=device, dtype=dtype),
+                CLIPLCrossAttention(channels=320, device=device, dtype=dtype),
+            ),
+            fl.Chain(fl.Downsample(channels=320, scale_factor=2, padding=1, device=device, dtype=dtype)),
+            fl.Chain(
+                ResidualBlock(in_channels=320, out_channels=640, device=device, dtype=dtype),
+                CLIPLCrossAttention(channels=640, device=device, dtype=dtype),
+            ),
+            fl.Chain(
+                ResidualBlock(in_channels=640, out_channels=640, device=device, dtype=dtype),
+                CLIPLCrossAttention(channels=640, device=device, dtype=dtype),
+            ),
+            fl.Chain(fl.Downsample(channels=640, scale_factor=2, padding=1, device=device, dtype=dtype)),
+            fl.Chain(
+                ResidualBlock(in_channels=640, out_channels=1280, device=device, dtype=dtype),
+                CLIPLCrossAttention(channels=1280, device=device, dtype=dtype),
+            ),
+            fl.Chain(
+                ResidualBlock(in_channels=1280, out_channels=1280, device=device, dtype=dtype),
+                CLIPLCrossAttention(channels=1280, device=device, dtype=dtype),
+            ),
+            fl.Chain(fl.Downsample(channels=1280, scale_factor=2, padding=1, device=device, dtype=dtype)),
+            fl.Chain(
+                ResidualBlock(in_channels=1280, out_channels=1280, device=device, dtype=dtype),
+            ),
+            fl.Chain(
+                ResidualBlock(in_channels=1280, out_channels=1280, device=device, dtype=dtype),
+            ),
+        )
+
+
+class UpBlocks(fl.Chain):
+    def __init__(
+        self,
+        device: Device | str | None = None,
+        dtype: DType | None = None,
+    ) -> None:
+        super().__init__(
+            fl.Chain(
+                ResidualBlock(in_channels=2560, out_channels=1280, device=device, dtype=dtype),
+            ),
+            fl.Chain(
+                ResidualBlock(in_channels=2560, out_channels=1280, device=device, dtype=dtype),
+            ),
+            fl.Chain(
+                ResidualBlock(in_channels=2560, out_channels=1280, device=device, dtype=dtype),
+                fl.Upsample(channels=1280, device=device, dtype=dtype),
+            ),
+            fl.Chain(
+                ResidualBlock(in_channels=2560, out_channels=1280, device=device, dtype=dtype),
+                CLIPLCrossAttention(channels=1280, device=device, dtype=dtype),
+            ),
+            fl.Chain(
+                ResidualBlock(in_channels=2560, out_channels=1280, device=device, dtype=dtype),
+                CLIPLCrossAttention(channels=1280, device=device, dtype=dtype),
+            ),
+            fl.Chain(
+                ResidualBlock(in_channels=1920, out_channels=1280, device=device, dtype=dtype),
+                CLIPLCrossAttention(channels=1280, device=device, dtype=dtype),
+                fl.Upsample(channels=1280, device=device, dtype=dtype),
+            ),
+            fl.Chain(
+                ResidualBlock(in_channels=1920, out_channels=640, device=device, dtype=dtype),
+                CLIPLCrossAttention(channels=640, device=device, dtype=dtype),
+            ),
+            fl.Chain(
+                ResidualBlock(in_channels=1280, out_channels=640, device=device, dtype=dtype),
+                CLIPLCrossAttention(channels=640, device=device, dtype=dtype),
+            ),
+            fl.Chain(
+                ResidualBlock(in_channels=960, out_channels=640, device=device, dtype=dtype),
+                CLIPLCrossAttention(channels=640, device=device, dtype=dtype),
+                fl.Upsample(channels=640, device=device, dtype=dtype),
+            ),
+            fl.Chain(
+                ResidualBlock(in_channels=960, out_channels=320, device=device, dtype=dtype),
+                CLIPLCrossAttention(channels=320, device=device, dtype=dtype),
+            ),
+            fl.Chain(
+                ResidualBlock(in_channels=640, out_channels=320, device=device, dtype=dtype),
+                CLIPLCrossAttention(channels=320, device=device, dtype=dtype),
+            ),
+            fl.Chain(
+                ResidualBlock(in_channels=640, out_channels=320, device=device, dtype=dtype),
+                CLIPLCrossAttention(channels=320, device=device, dtype=dtype),
+            ),
+        )
+
+
+class MiddleBlock(fl.Chain):
+    def __init__(self, device: Device | str | None = None, dtype: DType | None = None) -> None:
+        super().__init__(
+            ResidualBlock(in_channels=1280, out_channels=1280, device=device, dtype=dtype),
+            CLIPLCrossAttention(channels=1280, device=device, dtype=dtype),
+            ResidualBlock(in_channels=1280, out_channels=1280, device=device, dtype=dtype),
+        )
+
+
+class ResidualAccumulator(fl.Passthrough):
+    structural_attrs = ["n"]
+
+    def __init__(self, n: int) -> None:
+        self.n = n
+
+        super().__init__(
+            fl.Residual(
+                fl.UseContext(context="unet", key="residuals").compose(func=lambda residuals: residuals[self.n])
+            ),
+            fl.SetContext(context="unet", key="residuals", callback=self.update),
+        )
+
+    def update(self, residuals: list[Tensor | float], x: Tensor) -> None:
+        residuals[self.n] = x
+
+
+class ResidualConcatenator(fl.Chain):
+    structural_attrs = ["n"]
+
+    def __init__(self, n: int) -> None:
+        self.n = n
+
+        super().__init__(
+            fl.Concatenate(
+                fl.Identity(),
+                fl.UseContext(context="unet", key="residuals").compose(lambda residuals: residuals[self.n]),
+                dim=1,
+            ),
+        )
+
+
+class UNet(fl.Chain):
+    structural_attrs = ["in_channels", "clip_embedding_dim"]
+
+    def __init__(
+        self,
+        in_channels: int,
+        clip_embedding_dim: int,
+        device: Device | str | None = None,
+        dtype: DType | None = None,
+    ):
+        self.in_channels = in_channels
+        self.clip_embedding_dim = clip_embedding_dim
+        super().__init__(
+            TimestepEncoder(device=device, dtype=dtype),
+            DownBlocks(in_channels=in_channels, device=device, dtype=dtype),
+            fl.Sum(
+                fl.UseContext(context="unet", key="residuals").compose(lambda x: x[-1]),
+                MiddleBlock(device=device, dtype=dtype),
+            ),
+            UpBlocks(),
+            fl.Chain(
+                fl.GroupNorm(channels=320, num_groups=32, device=device, dtype=dtype),
+                fl.SiLU(),
+                fl.Conv2d(
+                    in_channels=320,
+                    out_channels=4,
+                    kernel_size=3,
+                    stride=1,
+                    padding=1,
+                    device=device,
+                    dtype=dtype,
+                ),
+            ),
+        )
+        for residual_block in self.layers(ResidualBlock):
+            chain = residual_block.Chain
+            range_adapter = RangeAdapter2d(
+                target=chain.Conv2d_1,
+                channels=residual_block.out_channels,
+                embedding_dim=1280,
+                context_key="timestep_embedding",
+                device=device,
+                dtype=dtype,
+            )
+            range_adapter.inject(chain)
+        for n, block in enumerate(cast(Iterable[fl.Chain], self.DownBlocks)):
+            block.append(ResidualAccumulator(n))
+        for n, block in enumerate(cast(Iterable[fl.Chain], self.UpBlocks)):
+            block.insert(0, ResidualConcatenator(-n - 2))
+
+    def init_context(self) -> Contexts:
+        return {
+            "unet": {"residuals": [0.0] * 13},
+            "diffusion": {"timestep": None},
+            "range_adapter": {"timestep_embedding": None},
+            "sampling": {"shapes": []},
+        }
+
+    def set_clip_text_embedding(self, clip_text_embedding: Tensor) -> None:
+        self.set_context("cross_attention_block", {"clip_text_embedding": clip_text_embedding})
+
+    def set_timestep(self, timestep: Tensor) -> None:
+        self.set_context("diffusion", {"timestep": timestep})
--- a/src/refiners/py.typed
+++ b/src/refiners/py.typed
--- a/src/refiners/training_utils/init.py
+++ b/src/refiners/training_utils/init.py
@ -0,0 +1,17 @@
+from importlib import import_module
+from importlib.metadata import requires
+import sys
+
+refiners_requires = requires("refiners")
+assert refiners_requires is not None
+
+for dep in filter(lambda r: r.endswith('extra == "training"'), refiners_requires):
+    try:
+        import_module(dep.split(" ")[0])
+    except ImportError:
+        print(
+            "Some dependencies are missing. Please install refiners with the `training` extra, e.g. `pip install"
+            " refiners[training]`",
+            file=sys.stderr,
+        )
+        sys.exit(1)
--- a/src/refiners/training_utils/callback.py
+++ b/src/refiners/training_utils/callback.py
@ -0,0 +1,186 @@
+from typing import TYPE_CHECKING, Generic, Iterable, Any, TypeVar
+from torch import tensor
+from torch.nn import Parameter
+from loguru import logger
+
+if TYPE_CHECKING:
+    from refiners.training_utils.config import BaseConfig
+    from refiners.training_utils.trainer import Trainer
+
+__all__ = [
+    "Callback",
+    "GradientNormClipping",
+    "GradientValueClipping",
+    "ClockCallback",
+    "GradientNormLogging",
+    "MonitorLoss",
+]
+
+
+def clip_gradient_norm(parameters: Iterable[Parameter], total_norm: float, clip_norm: float = 1.0) -> None:
+    """
+    Clips the gradient norm of the parameters of a given model similar to `clip_grad_norm_`.
+    """
+    gradients = [p.grad.detach() for p in parameters if p.grad is not None]
+    assert gradients, "The model has no gradients to clip."
+    clip_coefficient = tensor(data=clip_norm / (total_norm + 1e-6)).clamp(max=1)
+    for gradient in gradients:
+        gradient.mul_(other=clip_coefficient)  # type: ignore
+
+
+def clip_gradient_value(parameters: Iterable[Parameter], clip_value: float) -> None:
+    """
+    Clips the gradients of the parameters of a given model at an individual level similar to `clip_grad_value_`.
+    """
+    gradients = [p.grad.detach() for p in parameters if p.grad is not None]
+    assert gradients, "The model has no gradients to clip."
+    for gradient in gradients:
+        gradient.clamp_(min=-clip_value, max=clip_value)
+
+
+T = TypeVar("T")
+
+
+class Callback(Generic[T]):
+    def on_train_begin(self, trainer: T) -> None:
+        ...
+
+    def on_train_end(self, trainer: T) -> None:
+        ...
+
+    def on_epoch_begin(self, trainer: T) -> None:
+        ...
+
+    def on_epoch_end(self, trainer: T) -> None:
+        ...
+
+    def on_batch_begin(self, trainer: T) -> None:
+        ...
+
+    def on_batch_end(self, trainer: T) -> None:
+        ...
+
+    def on_backward_begin(self, trainer: T) -> None:
+        ...
+
+    def on_backward_end(self, trainer: T) -> None:
+        ...
+
+    def on_optimizer_step_begin(self, trainer: T) -> None:
+        ...
+
+    def on_optimizer_step_end(self, trainer: T) -> None:
+        ...
+
+    def on_compute_loss_begin(self, trainer: T) -> None:
+        ...
+
+    def on_compute_loss_end(self, trainer: T) -> None:
+        ...
+
+    def on_evaluate_begin(self, trainer: T) -> None:
+        ...
+
+    def on_evaluate_end(self, trainer: T) -> None:
+        ...
+
+    def on_lr_scheduler_step_begin(self, trainer: T) -> None:
+        ...
+
+    def on_lr_scheduler_step_end(self, trainer: T) -> None:
+        ...
+
+    def on_checkpoint_save(self, trainer: T) -> None:
+        ...
+
+
+class ClockCallback(Callback["Trainer[BaseConfig, Any]"]):
+    def on_train_begin(self, trainer: "Trainer[BaseConfig, Any]") -> None:
+        trainer.clock.reset()
+        logger.info(f"""Starting training for a total of:
+            {trainer.clock.num_steps} steps.
+            {trainer.clock.num_epochs} epochs.
+            {trainer.clock.num_iterations} iterations.
+        """)
+        trainer.clock.start_timer()
+
+    def on_train_end(self, trainer: "Trainer[BaseConfig, Any]") -> None:
+        trainer.clock.stop_timer()
+        logger.info(f"""Training took:
+            {trainer.clock.time_elapsed} seconds.
+            {trainer.clock.iteration} iterations.
+            {trainer.clock.epoch} epochs.
+            {trainer.clock.step} steps.
+        """)
+
+    def on_epoch_begin(self, trainer: "Trainer[BaseConfig, Any]") -> None:
+        logger.info(f"Epoch {trainer.clock.epoch} started.")
+
+    def on_epoch_end(self, trainer: "Trainer[BaseConfig, Any]") -> None:
+        trainer.clock.epoch += 1
+        trainer.clock.num_batches_processed = 0
+
+    def on_batch_begin(self, trainer: "Trainer[BaseConfig, Any]") -> None:
+        logger.info(f"Step {trainer.clock.step} started.")
+
+    def on_backward_end(self, trainer: "Trainer[BaseConfig, Any]") -> None:
+        trainer.clock.step += 1
+        trainer.clock.num_batches_processed += 1
+        trainer.clock.num_minibatches_processed += 1
+
+    def on_optimizer_step_end(self, trainer: "Trainer[BaseConfig, Any]") -> None:
+        logger.info(f"Iteration {trainer.clock.iteration} ended.")
+        trainer.clock.iteration += 1
+        trainer.clock.num_minibatches_processed = 0
+
+    def on_evaluate_begin(self, trainer: "Trainer[BaseConfig, Any]") -> None:
+        logger.info("Evaluation started.")
+
+    def on_evaluate_end(self, trainer: "Trainer[BaseConfig, Any]") -> None:
+        logger.info("Evaluation ended.")
+
+
+class MonitorLoss(Callback["Trainer[BaseConfig, Any]"]):
+    def on_train_begin(self, trainer: "Trainer[BaseConfig, Any]") -> None:
+        self.epoch_losses: list[float] = []
+        self.iteration_losses: list[float] = []
+
+    def on_compute_loss_end(self, trainer: "Trainer[BaseConfig, Any]") -> None:
+        loss_value = trainer.loss.detach().cpu().item()
+        self.epoch_losses.append(loss_value)
+        self.iteration_losses.append(loss_value)
+        trainer.log(data={"step_loss": loss_value})
+
+    def on_optimizer_step_end(self, trainer: "Trainer[BaseConfig, Any]") -> None:
+        avg_iteration_loss = sum(self.iteration_losses) / len(self.iteration_losses)
+        trainer.log(data={"average_iteration_loss": avg_iteration_loss})
+        self.iteration_losses = []
+
+    def on_epoch_end(self, trainer: "Trainer[BaseConfig, Any]") -> None:
+        avg_epoch_loss = sum(self.epoch_losses) / len(self.epoch_losses)
+        trainer.log(data={"average_epoch_loss": avg_epoch_loss, "epoch": trainer.clock.epoch})
+        self.epoch_losses = []
+
+    def on_lr_scheduler_step_end(self, trainer: "Trainer[BaseConfig, Any]") -> None:
+        trainer.log(data={"learning_rate": trainer.optimizer.param_groups[0]["lr"]})
+
+
+class GradientNormClipping(Callback["Trainer[BaseConfig, Any]"]):
+    def on_backward_end(self, trainer: "Trainer[BaseConfig, Any]") -> None:
+        clip_norm = trainer.config.training.clip_grad_norm
+        if clip_norm is not None:
+            clip_gradient_norm(
+                parameters=trainer.learnable_parameters, total_norm=trainer.total_gradient_norm, clip_norm=clip_norm
+            )
+
+
+class GradientValueClipping(Callback["Trainer[BaseConfig, Any]"]):
+    def on_backward_end(self, trainer: "Trainer[BaseConfig, Any]") -> None:
+        clip_value = trainer.config.training.clip_grad_value
+        if clip_value is not None:
+            clip_gradient_value(parameters=trainer.learnable_parameters, clip_value=clip_value)
+
+
+class GradientNormLogging(Callback["Trainer[BaseConfig, Any]"]):
+    def on_backward_end(self, trainer: "Trainer[BaseConfig, Any]") -> None:
+        trainer.log(data={"total_grad_norm": trainer.total_gradient_norm})
--- a/src/refiners/training_utils/config.py
+++ b/src/refiners/training_utils/config.py
@ -0,0 +1,242 @@
+from logging import warn
+from pathlib import Path
+from typing import Any, Callable, Iterable, Literal, Type, TypeVar
+from typing_extensions import TypedDict  # https://errors.pydantic.dev/2.0b3/u/typed-dict-version
+from torch.optim import AdamW, SGD, Optimizer, Adam
+from torch.nn import Parameter
+from enum import Enum
+from bitsandbytes.optim import AdamW8bit, Lion8bit  # type: ignore
+from pydantic import BaseModel, validator
+import tomli
+import refiners.fluxion.layers as fl
+from prodigyopt import Prodigy  # type: ignore
+from refiners.training_utils.dropout import apply_dropout, apply_gyro_dropout
+
+
+__all__ = [
+    "parse_number_unit_field",
+    "TimeUnit",
+    "TimeValue",
+    "TrainingConfig",
+    "OptimizerConfig",
+    "Optimizers",
+]
+
+
+class TimeUnit(Enum):
+    STEP = "step"
+    EPOCH = "epoch"
+    ITERATION = "iteration"
+    DEFAULT = "step"
+
+
+class TimeValue(TypedDict):
+    number: int
+    unit: TimeUnit
+
+
+def parse_number_unit_field(value: str | int | dict[str, str | int]) -> TimeValue:
+    match value:
+        case str(value_str):
+            number, unit = value_str.split(sep=":")
+            return {"number": int(number.strip()), "unit": TimeUnit(value=unit.strip().lower())}
+        case int(number):
+            return {"number": number, "unit": TimeUnit.DEFAULT}
+        case {"number": int(number), "unit": str(unit)}:
+            return {"number": number, "unit": TimeUnit(value=unit.lower())}
+        case _:
+            raise ValueError(f"Unsupported value format: {value}")
+
+
+class TrainingConfig(BaseModel):
+    duration: TimeValue = {"number": 1, "unit": TimeUnit.ITERATION}
+    seed: int = 0
+    gpu_index: int = 0
+    batch_size: int = 1
+    gradient_accumulation: TimeValue = {"number": 1, "unit": TimeUnit.STEP}
+    clip_grad_norm: float | None = None
+    clip_grad_value: float | None = None
+    evaluation_interval: TimeValue = {"number": 1, "unit": TimeUnit.ITERATION}
+    evaluation_seed: int = 0
+
+    @validator("duration", "gradient_accumulation", "evaluation_interval", pre=True)
+    def parse_field(cls, value: Any) -> TimeValue:
+        return parse_number_unit_field(value)
+
+
+class Optimizers(str, Enum):
+    SGD = "SGD"
+    Adam = "Adam"
+    AdamW = "AdamW"
+    AdamW8bit = "AdamW8bit"
+    Lion8bit = "Lion8bit"
+    Prodigy = "Prodigy"
+
+
+class SchedulerType(str, Enum):
+    STEP_LR = "StepLR"
+    EXPONENTIAL_LR = "ExponentialLR"
+    REDUCE_LR_ON_PLATEAU = "ReduceLROnPlateau"
+    COSINE_ANNEALING_LR = "CosineAnnealingLR"
+    CONSTANT_LR = "ConstantLR"  # not to be confused with PyTorch's ConstantLR
+    LAMBDA_LR = "LambdaLR"
+    ONE_CYCLE_LR = "OneCycleLR"
+    MULTIPLICATIVE_LR = "MultiplicativeLR"
+    COSINE_ANNEALING_WARM_RESTARTS = "CosineAnnealingWarmRestarts"
+    CYCLIC_LR = "CyclicLR"
+    MULTI_STEP_LR = "MultiStepLR"
+    DEFAULT = "ConstantLR"
+
+
+class SchedulerConfig(BaseModel):
+    scheduler_type: SchedulerType = SchedulerType.DEFAULT
+    update_interval: TimeValue = {"number": 1, "unit": TimeUnit.ITERATION}
+    warmup: TimeValue = {"number": 0, "unit": TimeUnit.ITERATION}
+    gamma: float = 0.1
+    lr_lambda: Callable[[int], float] | None = None
+    mode: Literal["min", "max"] = "min"
+    factor: float = 0.1
+    patience: int = 10
+    threshold: float = 1e-4
+    cooldown: int = 0
+    milestones: list[int] = []
+    base_lr: float = 1e-7
+    min_lr: float | list[float] = 0
+    max_lr: float | list[float] = 0
+    eta_min: float = 0
+
+    @validator("update_interval", "warmup", pre=True)
+    def parse_field(cls, value: Any) -> TimeValue:
+        return parse_number_unit_field(value)
+
+
+class OptimizerConfig(BaseModel):
+    optimizer: Optimizers
+    learning_rate: float = 1e-4
+    betas: tuple[float, float] = (0.9, 0.999)
+    eps: float = 1e-8
+    weight_decay: float = 0.0
+
+    def get(self, model_parameters: Iterable[Parameter]) -> Optimizer:
+        match self.optimizer:
+            case Optimizers.SGD:
+                return SGD(
+                    params=model_parameters,
+                    lr=self.learning_rate,
+                    weight_decay=self.weight_decay,
+                )
+            case Optimizers.Adam:
+                return Adam(
+                    params=model_parameters,
+                    lr=self.learning_rate,
+                    betas=self.betas,
+                    eps=self.eps,
+                    weight_decay=self.weight_decay,
+                )
+            case Optimizers.AdamW:
+                return AdamW(
+                    params=model_parameters,
+                    lr=self.learning_rate,
+                    betas=self.betas,
+                    eps=self.eps,
+                    weight_decay=self.weight_decay,
+                )
+            case Optimizers.AdamW8bit:
+                return AdamW8bit(
+                    params=model_parameters,
+                    lr=self.learning_rate,
+                    betas=self.betas,
+                    eps=self.eps,
+                    weight_decay=self.weight_decay,
+                )
+            case Optimizers.Lion8bit:
+                return Lion8bit(
+                    params=model_parameters,
+                    lr=self.learning_rate,
+                    betas=self.betas,
+                    weight_decay=self.weight_decay,  # type: ignore
+                )
+            case Optimizers.Prodigy:
+                if self.learning_rate != 1.0:
+                    warn("Prodigy learning rate is not 1.0, this might cause instability.")
+                return Prodigy(
+                    lr=self.learning_rate,
+                    params=model_parameters,
+                    betas=self.betas,
+                    weight_decay=self.weight_decay,  # type: ignore
+                    safeguard_warmup=True,
+                )
+
+
+class ModelConfig(BaseModel):
+    checkpoint: Path | None = None
+    train: bool = True
+    learning_rate: float | None = None  # TODO: Implement this
+
+
+class GyroDropoutConfig(BaseModel):
+    total_subnetworks: int = 512
+    concurent_subnetworks: int = 64
+    iters_per_epoch: int = 512
+    num_features_threshold: float = 5e5
+
+
+class DropoutConfig(BaseModel):
+    dropout_probability: float = 0.0
+    gyro_dropout: GyroDropoutConfig | None = None
+
+    def apply_dropout(self, model: fl.Chain) -> None:
+        if self.dropout_probability > 0.0:
+            if self.gyro_dropout is not None:
+                apply_gyro_dropout(module=model, probability=self.dropout_probability, **self.gyro_dropout.model_dump())
+            else:
+                apply_dropout(module=model, probability=self.dropout_probability)
+
+
+class WandbConfig(BaseModel):
+    mode: Literal["online", "offline", "disabled"] = "online"
+    project: str
+    entity: str = "finegrain"
+    name: str | None = None
+    tags: list[str] = []
+    group: str | None = None
+    job_type: str | None = None
+    notes: str | None = None
+
+
+class HuggingfaceDatasetConfig(BaseModel):
+    hf_repo: str = "finegrain/unsplash-dummy"
+    revision: str = "main"
+    split: str = "train"
+    use_verification: bool = False
+
+
+class CheckpointingConfig(BaseModel):
+    save_folder: Path | None = None
+    save_interval: TimeValue = {"number": 1, "unit": TimeUnit.EPOCH}
+
+    @validator("save_interval", pre=True)
+    def parse_field(cls, value: Any) -> TimeValue:
+        return parse_number_unit_field(value)
+
+
+T = TypeVar("T", bound="BaseConfig")
+
+
+class BaseConfig(BaseModel):
+    script: Path  # TODO not used for now, but will be used by the cli
+    models: dict[str, ModelConfig]
+    wandb: WandbConfig
+    training: TrainingConfig
+    optimizer: OptimizerConfig
+    scheduler: SchedulerConfig
+    dropout: DropoutConfig
+    dataset: HuggingfaceDatasetConfig
+    checkpointing: CheckpointingConfig
+
+    @classmethod
+    def load_from_toml(cls: Type[T], toml_path: Path | str) -> T:
+        with open(file=toml_path, mode="rb") as f:
+            config_dict = tomli.load(f)
+
+        return cls(**config_dict)
--- a/src/refiners/training_utils/dropout.py
+++ b/src/refiners/training_utils/dropout.py
@ -0,0 +1,202 @@
+from typing import TYPE_CHECKING, Any, TypeVar
+
+from torch import Tensor, randint, cat, rand
+from torch.nn import Dropout as TorchDropout
+
+import refiners.fluxion.layers as fl
+from refiners.training_utils.callback import Callback
+from refiners.adapters.adapter import Adapter
+
+if TYPE_CHECKING:
+    from refiners.training_utils.config import BaseConfig
+    from refiners.training_utils.trainer import Trainer
+
+
+__all__ = ["Dropout", "GyroDropout", "DropoutCallback"]
+
+
+class Dropout(TorchDropout, fl.Module):
+    def __init__(self, probability: float = 0.5, inplace: bool = False) -> None:
+        super().__init__(p=probability, inplace=inplace)
+
+
+class GyroDropout(fl.Module):
+    """
+    GyroDropout is a variant of dropout that maximizes the ensemble effect during neural network training.
+    It pre-selects a fixed number of dropout masks and periodically selects a subset of them for training.
+    This leads to increased robustness and diversity among the subnetworks, improving accuracy compared to conventional
+    dropout.
+
+    Parameters:
+    -----------
+    total_subnetworks:
+        The total number of pre-selected subnetworks ('Sigma'). These subnetworks are dropout masks
+        that are precomputed and stored.
+
+    concurrent_subnetworks:
+        The number of subnetworks to use concurrently in each forward pass ('Tau'). A random selection of
+        masks from the precomputed set is used to dropout different portions of the input.
+
+    dropout_probability: float, optional (default=0.5)
+        The probability that an element will be zeroed by the dropout.
+
+    iters_per_epoch:
+        Number of iterations per epoch, used to determine how often the masks should be updated.
+
+    num_features_threshold:
+        If the number of features in the input is greater than this threshold, dropout is skipped. This is because
+        gyro dropout mask size vram usage is proportional to the number of features in the input.
+    """
+
+    def __init__(
+        self,
+        total_subnetworks: int,
+        concurrent_subnetworks: int,
+        dropout_probability: float = 0.5,
+        iters_per_epoch: int = 1,
+        num_features_threshold: float = 5e5,
+    ) -> None:
+        super().__init__()
+        assert (
+            iters_per_epoch >= total_subnetworks
+        ), "The number of iterations per epoch must be greater than the number of masks"
+        self.dropout_probability = dropout_probability
+        self.iters_per_epoch = iters_per_epoch
+        self.total_subnetworks = total_subnetworks
+        self.concurrent_subnetworks = concurrent_subnetworks
+        self.scale = 1 / (1 - self.dropout_probability)
+        self.mask_update_interval = int(self.iters_per_epoch / self.total_subnetworks) * self.concurrent_subnetworks
+        self.preselected_masks: Tensor | None = None
+        self.dropout_mask = None
+        self.training_step = 0
+        self.num_features_threshold = num_features_threshold
+        self.skip_high_num_features = False
+
+    def forward(self, x: Tensor) -> Tensor:
+        if not self.training:
+            return x
+        if self.skip_high_num_features:
+            return self.basic_dropout(x)
+        if self.training_step == 0:
+            num_features = x.shape[1] * x.shape[2] if x.dim() == 3 else x.shape[1]
+            if num_features > self.num_features_threshold:
+                self.skip_high_num_features = True
+                self.basic_dropout = Dropout(probability=self.dropout_probability)
+                return self.basic_dropout(x)
+            self.init_masks(x=x)
+
+        if self.training_step % self.mask_update_interval == 0:
+            self.update_dropout_mask(x=x)
+
+        self.training_step += 1
+
+        return x * self.dropout_mask * self.scale
+
+    def init_masks(self, x: Tensor) -> None:
+        if x.dim() == 2:
+            self.preselected_masks = (
+                rand(self.total_subnetworks, x.shape[1], device=x.device) > self.dropout_probability
+            )
+        if x.dim() == 3:
+            self.preselected_masks = (
+                rand(self.total_subnetworks, x.shape[1], x.shape[2], device=x.device) > self.dropout_probability
+            )
+
+        assert self.preselected_masks is not None, "The input tensor must have 2 or 3 dimensions"
+        self.preselected_masks = self.preselected_masks.float()
+
+    def update_dropout_mask(self, x: Tensor) -> None:
+        assert self.preselected_masks is not None
+        indices = randint(low=0, high=self.total_subnetworks, size=(self.concurrent_subnetworks,), device=x.device)
+        selected_masks = self.preselected_masks[indices]
+
+        repeat_factor = x.shape[0] // self.concurrent_subnetworks
+        remaining = x.shape[0] % self.concurrent_subnetworks
+        repeated_masks = [selected_masks] * repeat_factor
+        if remaining > 0:
+            repeated_masks.append(selected_masks[:remaining])
+        final_masks = cat(tensors=repeated_masks, dim=0)
+
+        if x.dim() == 2:
+            self.dropout_mask = final_masks
+        if x.dim() == 3:
+            self.dropout_mask = final_masks.expand_as(x)
+
+
+class DropoutAdapter(fl.Chain, Adapter[fl.Linear]):
+    def __init__(self, target: fl.Linear, probability: float = 0.5):
+        with self.setup_adapter(target):
+            super().__init__(target, Dropout(probability=probability))
+
+
+class GyroDropoutAdapter(fl.Chain, Adapter[fl.Linear]):
+    def __init__(
+        self,
+        target: fl.Linear,
+        probability: float = 0.5,
+        total_subnetworks: int = 512,
+        concurrent_subnetworks: int = 64,
+        iters_per_epoch: int = 512,
+        num_features_threshold: float = 5e5,
+    ) -> None:
+        self.probability = probability
+        self.total_subnetworks = total_subnetworks
+        self.concurrent_subnetworks = concurrent_subnetworks
+        self.iters_per_epoch = iters_per_epoch
+
+        with self.setup_adapter(target):
+            super().__init__(
+                target,
+                GyroDropout(
+                    total_subnetworks=total_subnetworks,
+                    concurrent_subnetworks=concurrent_subnetworks,
+                    dropout_probability=probability,
+                    iters_per_epoch=iters_per_epoch,
+                    num_features_threshold=num_features_threshold,
+                ),
+            )
+
+
+def apply_dropout(module: fl.Chain, probability: float = 0.5) -> None:
+    for linear, parent in module.walk(fl.Linear):
+        if not linear.weight.requires_grad:
+            continue
+        assert not (
+            isinstance(parent, Dropout) or isinstance(parent, GyroDropout)
+        ), f"{linear} already has a dropout layer"
+        adapter = DropoutAdapter(target=linear, probability=probability)
+        adapter.inject(parent)
+
+
+def apply_gyro_dropout(
+    module: fl.Chain,
+    probability: float = 0.5,
+    total_subnetworks: int = 32,
+    concurrent_subnetworks: int = 16,
+    iters_per_epoch: int = 32,
+) -> None:
+    for linear, parent in module.walk(fl.Linear):
+        if not linear.weight.requires_grad:
+            continue
+        assert not (
+            isinstance(parent, Dropout) or isinstance(parent, GyroDropout)
+        ), f"{linear} already has a dropout layer"
+        adapter = GyroDropoutAdapter(
+            target=linear,
+            probability=probability,
+            total_subnetworks=total_subnetworks,
+            concurrent_subnetworks=concurrent_subnetworks,
+            iters_per_epoch=iters_per_epoch,
+        )
+        adapter.inject(parent)
+
+
+ConfigType = TypeVar("ConfigType", bound="BaseConfig")
+
+
+class DropoutCallback(Callback["Trainer[ConfigType, Any]"]):
+    def on_train_begin(self, trainer: "Trainer[ConfigType, Any]") -> None:
+        dropout_config = trainer.config.dropout
+        chain_models = [model for model in trainer.models.values() if isinstance(model, fl.Chain)]
+        for model in chain_models:
+            dropout_config.apply_dropout(model=model)
--- a/src/refiners/training_utils/huggingface_datasets.py
+++ b/src/refiners/training_utils/huggingface_datasets.py
@ -0,0 +1,23 @@
+from datasets import load_dataset as _load_dataset, VerificationMode  # type: ignore
+from typing import Any, Generic, Protocol, TypeVar, cast
+
+__all__ = ["load_hf_dataset", "HuggingfaceDataset"]
+
+
+T = TypeVar("T", covariant=True)
+
+
+class HuggingfaceDataset(Generic[T], Protocol):
+    def __getitem__(self, index: int) -> T:
+        ...
+
+    def __len__(self) -> int:
+        ...
+
+
+def load_hf_dataset(
+    path: str, revision: str = "main", split: str = "train", use_verification: bool = False
+) -> HuggingfaceDataset[Any]:
+    verification_mode = VerificationMode.BASIC_CHECKS if use_verification else VerificationMode.NO_CHECKS
+    dataset = _load_dataset(path=path, revision=revision, split=split, verification_mode=verification_mode)
+    return cast(HuggingfaceDataset[Any], dataset)
--- a/src/refiners/training_utils/latent_diffusion.py
+++ b/src/refiners/training_utils/latent_diffusion.py
@ -0,0 +1,238 @@
+from dataclasses import dataclass
+from typing import Any, TypeVar, TypedDict, cast
+from pydantic import BaseModel
+from refiners.foundationals.latent_diffusion.schedulers.ddpm import DDPM
+from torch import device as Device, Tensor, randn, dtype as DType, Generator, cat
+from loguru import logger
+from torch.utils.data import Dataset
+from refiners.foundationals.latent_diffusion.unet import UNet
+from refiners.foundationals.clip.text_encoder import CLIPTextEncoderL
+from refiners.foundationals.latent_diffusion.auto_encoder import LatentDiffusionAutoencoder
+from torchvision.transforms import RandomCrop  # type: ignore
+import refiners.fluxion.layers as fl
+from PIL import Image
+from functools import cached_property
+from refiners.training_utils.config import BaseConfig
+from refiners.foundationals.latent_diffusion import StableDiffusion_1
+from refiners.foundationals.latent_diffusion.schedulers import DPMSolver
+from torch.nn.functional import mse_loss
+import random
+from refiners.training_utils.wandb import WandbLoggable
+from refiners.training_utils.trainer import Trainer
+from refiners.training_utils.callback import Callback
+from refiners.training_utils.huggingface_datasets import load_hf_dataset, HuggingfaceDataset
+
+
+class LatentDiffusionConfig(BaseModel):
+    unconditional_sampling_probability: float = 0.2
+    offset_noise: float = 0.1
+    min_timestep: int = 0
+    max_timestep: int = 999
+
+
+class TestDiffusionConfig(BaseModel):
+    seed: int = 0
+    num_inference_steps: int = 30
+    use_short_prompts: bool = False
+    prompts: list[str] = []
+    num_images_per_prompt: int = 1
+
+
+class FinetuneLatentDiffusionConfig(BaseConfig):
+    latent_diffusion: LatentDiffusionConfig
+    test_diffusion: TestDiffusionConfig
+
+
+@dataclass
+class TextEmbeddingLatentsBatch:
+    text_embeddings: Tensor
+    latents: Tensor
+
+
+class CaptionImage(TypedDict):
+    caption: str
+    image: Image.Image
+
+
+ConfigType = TypeVar("ConfigType", bound=FinetuneLatentDiffusionConfig)
+
+
+class TextEmbeddingLatentsDataset(Dataset[TextEmbeddingLatentsBatch]):
+    def __init__(self, trainer: "LatentDiffusionTrainer[Any]") -> None:
+        self.trainer = trainer
+        self.config = trainer.config
+        self.device = self.trainer.device
+        self.lda = self.trainer.lda
+        self.text_encoder = self.trainer.text_encoder
+        self.dataset = self.load_huggingface_dataset()
+        self.process_image = RandomCrop(size=512)  # TODO: make this configurable and add other transforms
+        logger.info(f"Loaded {len(self.dataset)} samples from dataset")
+
+    def load_huggingface_dataset(self) -> HuggingfaceDataset[CaptionImage]:
+        dataset_config = self.config.dataset
+        logger.info(f"Loading dataset from {dataset_config.hf_repo} revision {dataset_config.revision}")
+        return cast(
+            HuggingfaceDataset[CaptionImage],
+            load_hf_dataset(path=dataset_config.hf_repo, revision=dataset_config.revision, split=dataset_config.split),
+        )
+
+    def resize_image(self, image: Image.Image, min_size: int = 512, max_size: int = 576) -> Image.Image:
+        return resize_image(image=image, min_size=min_size, max_size=max_size)
+
+    def process_caption(self, caption: str) -> str:
+        return caption if random.random() > self.config.latent_diffusion.unconditional_sampling_probability else ""
+
+    def __getitem__(self, index: int) -> TextEmbeddingLatentsBatch:
+        item = self.dataset[index]
+        caption, image = item["caption"], item["image"]
+        resized_image = self.resize_image(image=image)
+        processed_image: Image.Image = self.process_image(resized_image)
+        latents = self.lda.encode_image(image=processed_image).to(device=self.device)
+        processed_caption = self.process_caption(caption=caption)
+        clip_text_embedding = self.text_encoder.encode(text=processed_caption).to(device=self.device)
+        return TextEmbeddingLatentsBatch(text_embeddings=clip_text_embedding, latents=latents)
+
+    def collate_fn(self, batch: list[TextEmbeddingLatentsBatch]) -> TextEmbeddingLatentsBatch:
+        text_embeddings = cat(tensors=[item.text_embeddings for item in batch])
+        latents = cat(tensors=[item.latents for item in batch])
+        return TextEmbeddingLatentsBatch(text_embeddings=text_embeddings, latents=latents)
+
+    def __len__(self) -> int:
+        return len(self.dataset)
+
+
+class LatentDiffusionTrainer(Trainer[ConfigType, TextEmbeddingLatentsBatch]):
+    @cached_property
+    def unet(self) -> UNet:
+        assert self.config.models["unet"] is not None, "The config must contain a unet entry."
+        return UNet(in_channels=4, clip_embedding_dim=768, device=self.device).to(device=self.device)
+
+    @cached_property
+    def text_encoder(self) -> CLIPTextEncoderL:
+        assert self.config.models["text_encoder"] is not None, "The config must contain a text_encoder entry."
+        return CLIPTextEncoderL(device=self.device).to(device=self.device)
+
+    @cached_property
+    def lda(self) -> LatentDiffusionAutoencoder:
+        assert self.config.models["lda"] is not None, "The config must contain a lda entry."
+        return LatentDiffusionAutoencoder(device=self.device).to(device=self.device)
+
+    def load_models(self) -> dict[str, fl.Module]:
+        return {"unet": self.unet, "text_encoder": self.text_encoder, "lda": self.lda}
+
+    def load_dataset(self) -> Dataset[TextEmbeddingLatentsBatch]:
+        return TextEmbeddingLatentsDataset(trainer=self)
+
+    @cached_property
+    def ddpm_scheduler(self) -> DDPM:
+        return DDPM(
+            num_inference_steps=1000,
+            device=self.device,
+        ).to(device=self.device)
+
+    def sample_timestep(self) -> Tensor:
+        random_step = random.randint(
+            a=self.config.latent_diffusion.min_timestep, b=self.config.latent_diffusion.max_timestep
+        )
+        self.current_step = random_step
+        return self.ddpm_scheduler.timesteps[random_step].unsqueeze(dim=0)
+
+    def sample_noise(self, size: tuple[int, int, int, int], dtype: DType | None = None) -> Tensor:
+        return sample_noise(
+            size=size, offset_noise=self.config.latent_diffusion.offset_noise, device=self.device, dtype=dtype
+        )
+
+    def compute_loss(self, batch: TextEmbeddingLatentsBatch) -> Tensor:
+        clip_text_embedding, latents = batch.text_embeddings, batch.latents
+        timestep = self.sample_timestep()
+        noise = self.sample_noise(size=latents.shape, dtype=latents.dtype)
+        noisy_latents = self.ddpm_scheduler.add_noise(x=latents, noise=noise, step=self.current_step)
+        self.unet.set_timestep(timestep=timestep)
+        self.unet.set_clip_text_embedding(clip_text_embedding=clip_text_embedding)
+        prediction = self.unet(noisy_latents)
+        loss = mse_loss(input=prediction, target=noise)
+        return loss
+
+    def compute_evaluation(self) -> None:
+        sd = StableDiffusion_1(
+            unet=self.unet,
+            lda=self.lda,
+            clip_text_encoder=self.text_encoder,
+            scheduler=DPMSolver(num_inference_steps=self.config.test_diffusion.num_inference_steps),
+            device=self.device,
+        )
+        prompts = self.config.test_diffusion.prompts
+        num_images_per_prompt = self.config.test_diffusion.num_images_per_prompt
+        if self.config.test_diffusion.use_short_prompts:
+            prompts = [prompt.split(sep=",")[0] for prompt in prompts]
+        images: dict[str, WandbLoggable] = {}
+        for prompt in prompts:
+            canvas_image: Image.Image = Image.new(mode="RGB", size=(512, 512 * num_images_per_prompt))
+            for i in range(num_images_per_prompt):
+                logger.info(f"Generating image {i+1}/{num_images_per_prompt} for prompt: {prompt}")
+                x = randn(1, 4, 64, 64, device=self.device)
+                clip_text_embedding = sd.compute_text_embedding(text=prompt).to(device=self.device)
+                negative_clip_text_embedding = sd.compute_text_embedding(text="").to(device=self.device)
+                for step in sd.steps:
+                    x = sd(
+                        x,
+                        step=step,
+                        clip_text_embedding=clip_text_embedding,
+                        negative_clip_text_embedding=negative_clip_text_embedding,
+                    )
+                canvas_image.paste(sd.lda.decode_latents(x=x), box=(0, 512 * i))
+            images[prompt] = canvas_image
+        self.log(data=images)
+
+
+def sample_noise(
+    size: tuple[int, int, int, int],
+    offset_noise: float = 0.1,
+    device: Device | str = "cpu",
+    dtype: DType | None = None,
+    generator: Generator | None = None,
+) -> Tensor:
+    """Sample noise from a normal distribution.
+
+    If `offset_noise` is more than 0, the noise will be offset by a small amount. It allows the model to generate
+    images with a wider range of contrast https://www.crosslabs.org/blog/diffusion-with-offset-noise.
+    """
+    device = Device(device)
+    noise = randn(*size, generator=generator, device=device, dtype=dtype)
+    return noise + offset_noise * randn(*size[:2], 1, 1, generator=generator, device=device, dtype=dtype)
+
+
+def resize_image(image: Image.Image, min_size: int = 512, max_size: int = 576) -> Image.Image:
+    image_min_size = min(image.size)
+    if image_min_size > max_size:
+        if image_min_size == image.size[0]:
+            image = image.resize(size=(max_size, int(max_size * image.size[1] / image.size[0])))
+        else:
+            image = image.resize(size=(int(max_size * image.size[0] / image.size[1]), max_size))
+    if image_min_size < min_size:
+        if image_min_size == image.size[0]:
+            image = image.resize(size=(min_size, int(min_size * image.size[1] / image.size[0])))
+        else:
+            image = image.resize(size=(int(min_size * image.size[0] / image.size[1]), min_size))
+    return image
+
+
+class MonitorTimestepLoss(Callback[LatentDiffusionTrainer[Any]]):
+    def on_train_begin(self, trainer: LatentDiffusionTrainer[Any]) -> None:
+        self.timestep_bins: dict[int, list[float]] = {i: [] for i in range(10)}
+
+    def on_compute_loss_end(self, trainer: LatentDiffusionTrainer[Any]) -> None:
+        loss_value = trainer.loss.detach().cpu().item()
+        current_step = trainer.current_step
+        bin_index = min(current_step // 100, 9)
+        self.timestep_bins[bin_index].append(loss_value)
+
+    def on_epoch_end(self, trainer: LatentDiffusionTrainer[Any]) -> None:
+        log_data = {}
+        for bin_index, losses in self.timestep_bins.items():
+            if losses:
+                avg_loss = sum(losses) / len(losses)
+                log_data[f"average_loss_timestep_bin_{bin_index * 100}"] = avg_loss
+                self.timestep_bins[bin_index] = []
+
+        trainer.log(data=log_data)
--- a/src/refiners/training_utils/trainer.py
+++ b/src/refiners/training_utils/trainer.py
@ -0,0 +1,546 @@
+from functools import cached_property, wraps
+from pathlib import Path
+import random
+import time
+import numpy as np
+from torch import device as Device, Tensor, get_rng_state, no_grad, set_rng_state, cuda, stack
+from torch.nn import Parameter
+from torch.optim import Optimizer
+from torch.utils.data import DataLoader, Dataset
+from torch.autograd import backward
+from typing import Any, Callable, Generic, Iterable, TypeVar, cast
+from loguru import logger
+from refiners.fluxion import layers as fl
+from refiners.fluxion.utils import manual_seed
+from refiners.training_utils.wandb import WandbLogger, WandbLoggable
+from refiners.training_utils.config import BaseConfig, TimeUnit, TimeValue, SchedulerType
+from refiners.training_utils.dropout import DropoutCallback
+from refiners.training_utils.callback import (
+    Callback,
+    ClockCallback,
+    GradientNormClipping,
+    GradientValueClipping,
+    GradientNormLogging,
+    MonitorLoss,
+)
+from torch.optim.lr_scheduler import (
+    StepLR,
+    ExponentialLR,
+    ReduceLROnPlateau,
+    CosineAnnealingLR,
+    LambdaLR,
+    OneCycleLR,
+    LRScheduler,
+    MultiplicativeLR,
+    CosineAnnealingWarmRestarts,
+    CyclicLR,
+    MultiStepLR,
+)
+
+__all__ = ["seed_everything", "scoped_seed", "Trainer"]
+
+
+def count_learnable_parameters(parameters: Iterable[Parameter]) -> int:
+    return sum(p.numel() for p in parameters if p.requires_grad)
+
+
+def human_readable_number(number: int) -> str:
+    float_number = float(number)
+    for unit in ["", "K", "M", "G", "T", "P"]:
+        if abs(float_number) < 1000:
+            return f"{float_number:.1f}{unit}"
+        float_number /= 1000
+    return f"{float_number:.1f}E"
+
+
+def seed_everything(seed: int | None = None) -> None:
+    if seed is None:
+        seed = random.randint(0, 2**32 - 1)
+        logger.info(f"Using random seed: {seed}")
+    random.seed(a=seed)
+    np.random.seed(seed=seed)
+    manual_seed(seed=seed)
+    cuda.manual_seed_all(seed=seed)
+
+
+def scoped_seed(seed: int | Callable[..., int] | None = None) -> Callable[..., Callable[..., Any]]:
+    """
+    Decorator for setting a random seed within the scope of a function.
+
+    This decorator sets the random seed for Python's built-in `random` module,
+    `numpy`, and `torch` and `torch.cuda` at the beginning of the decorated function. After the
+    function is executed, it restores the state of the random number generators
+    to what it was before the function was called. This is useful for ensuring
+    reproducibility for specific parts of the code without affecting randomness
+    elsewhere.
+    """
+
+    def decorator(func: Callable[..., Any]) -> Callable[..., Any]:
+        @wraps(func)
+        def inner_wrapper(*args: Any, **kwargs: Any) -> Any:
+            random_state = random.getstate()
+            numpy_state = np.random.get_state()
+            torch_state = get_rng_state()
+            cuda_torch_state = cuda.get_rng_state()
+            actual_seed = seed(*args) if callable(seed) else seed
+            seed_everything(seed=actual_seed)
+            result = func(*args, **kwargs)
+            random.setstate(random_state)
+            np.random.set_state(numpy_state)
+            set_rng_state(torch_state)
+            cuda.set_rng_state(cuda_torch_state)
+            return result
+
+        return inner_wrapper
+
+    return decorator
+
+
+class WarmupScheduler(LRScheduler):
+    _step_count: int  # defined by LRScheduler
+
+    def __init__(self, optimizer: Optimizer, scheduler: LRScheduler, warmup_steps: int = 0) -> None:
+        self.warmup_steps = warmup_steps
+        self.scheduler = scheduler
+        super().__init__(optimizer=optimizer)
+
+    def get_lr(self) -> list[float] | float:  # type: ignore
+        if self._step_count < self.warmup_steps:
+            return [base_lr * self._step_count / self.warmup_steps for base_lr in self.base_lrs]
+        return self.scheduler.get_lr()
+
+    def step(self, epoch: int | None = None) -> None:
+        if self._step_count < self.warmup_steps:
+            super().step()
+        else:
+            self.scheduler.step(epoch=epoch)
+            self._step_count += 1
+
+
+class TrainingClock:
+    def __init__(
+        self,
+        dataset_length: int,
+        batch_size: int,
+        training_duration: TimeValue,
+        gradient_accumulation: TimeValue,
+        evaluation_interval: TimeValue,
+        lr_scheduler_interval: TimeValue,
+        checkpointing_save_interval: TimeValue,
+    ) -> None:
+        self.dataset_length = dataset_length
+        self.batch_size = batch_size
+        self.training_duration = training_duration
+        self.gradient_accumulation = gradient_accumulation
+        self.evaluation_interval = evaluation_interval
+        self.lr_scheduler_interval = lr_scheduler_interval
+        self.checkpointing_save_interval = checkpointing_save_interval
+        self.num_batches_per_epoch = dataset_length // batch_size
+        self.start_time = None
+        self.end_time = None
+        self.step = 0
+        self.epoch = 0
+        self.iteration = 0
+        self.num_batches_processed = 0
+        self.num_minibatches_processed = 0
+        self.loss: Tensor | None = None
+
+    @cached_property
+    def unit_to_steps(self) -> dict[TimeUnit, int]:
+        return {
+            TimeUnit.STEP: 1,
+            TimeUnit.EPOCH: self.num_batches_per_epoch,
+            TimeUnit.ITERATION: self.gradient_accumulation["number"] * {
+                TimeUnit.STEP: 1,
+                TimeUnit.EPOCH: self.num_batches_per_epoch,
+            }.get(self.gradient_accumulation["unit"], 1),
+        }
+
+    def convert_time_unit_to_steps(self, number: int, unit: TimeUnit) -> int:
+        return number * self.unit_to_steps[unit]
+
+    def convert_steps_to_time_unit(self, steps: int, unit: TimeUnit) -> int:
+        return steps // self.unit_to_steps[unit]
+
+    def convert_time_value(self, time_value: TimeValue, target_unit: TimeUnit) -> int:
+        number, unit = time_value["number"], time_value["unit"]
+        steps = self.convert_time_unit_to_steps(number=number, unit=unit)
+        return self.convert_steps_to_time_unit(steps=steps, unit=target_unit)
+
+    @cached_property
+    def num_epochs(self) -> int:
+        return self.convert_time_value(time_value=self.training_duration, target_unit=TimeUnit.EPOCH)
+
+    @cached_property
+    def num_iterations(self) -> int:
+        return self.convert_time_value(time_value=self.training_duration, target_unit=TimeUnit.ITERATION)
+
+    @cached_property
+    def num_steps(self) -> int:
+        return self.convert_time_value(time_value=self.training_duration, target_unit=TimeUnit.STEP)
+
+    @cached_property
+    def num_step_per_iteration(self) -> int:
+        return self.convert_time_unit_to_steps(
+            number=self.gradient_accumulation["number"], unit=self.gradient_accumulation["unit"]
+        )
+
+    @cached_property
+    def num_step_per_evaluation(self) -> int:
+        return self.convert_time_unit_to_steps(
+            number=self.evaluation_interval["number"], unit=self.evaluation_interval["unit"]
+        )
+
+    def reset(self) -> None:
+        self.start_time = None
+        self.end_time = None
+        self.step = 0
+        self.epoch = 0
+        self.iteration = 0
+        self.num_batches_processed = 0
+        self.num_minibatches_processed = 0
+
+    def start_timer(self) -> None:
+        self.start_time = time.time()
+
+    def stop_timer(self) -> None:
+        self.end_time = time.time()
+
+    @property
+    def time_elapsed(self) -> int:
+        assert self.start_time is not None, "Timer has not been started yet."
+        return int(time.time() - self.start_time)
+
+    @cached_property
+    def evalution_interval_steps(self) -> int:
+        return self.convert_time_unit_to_steps(
+            number=self.evaluation_interval["number"], unit=self.evaluation_interval["unit"]
+        )
+
+    @cached_property
+    def lr_scheduler_interval_steps(self) -> int:
+        return self.convert_time_unit_to_steps(
+            number=self.lr_scheduler_interval["number"], unit=self.lr_scheduler_interval["unit"]
+        )
+
+    @cached_property
+    def checkpointing_save_interval_steps(self) -> int:
+        return self.convert_time_unit_to_steps(
+            number=self.checkpointing_save_interval["number"], unit=self.checkpointing_save_interval["unit"]
+        )
+
+    @property
+    def is_optimizer_step(self) -> bool:
+        return self.num_minibatches_processed == self.num_step_per_iteration
+
+    @property
+    def is_lr_scheduler_step(self) -> bool:
+        return self.step % self.lr_scheduler_interval_steps == 0
+
+    @property
+    def done(self) -> bool:
+        return self.step >= self.num_steps
+
+    @property
+    def is_evaluation_step(self) -> bool:
+        return self.step % self.evalution_interval_steps == 0
+
+    @property
+    def is_checkpointing_step(self) -> bool:
+        return self.step % self.checkpointing_save_interval_steps == 0
+
+
+def compute_grad_norm(parameters: Iterable[Parameter]) -> float:
+    """
+    Computes the gradient norm of the parameters of a given model similar to `clip_grad_norm_` returned value.
+    """
+    gradients: list[Tensor] = [p.grad.detach() for p in parameters if p.grad is not None]
+    assert gradients, "The model has no gradients to compute the norm."
+    total_norm = stack(tensors=[gradient.norm() for gradient in gradients]).norm().item()  # type: ignore
+    return total_norm  # type: ignore
+
+
+Batch = TypeVar("Batch")
+ConfigType = TypeVar("ConfigType", bound=BaseConfig)
+
+
+class Trainer(Generic[ConfigType, Batch]):
+    def __init__(self, config: ConfigType, callbacks: list[Callback[Any]] | None = None) -> None:
+        self.config = config
+        self.clock = TrainingClock(
+            dataset_length=self.dataset_length,
+            batch_size=config.training.batch_size,
+            training_duration=config.training.duration,
+            evaluation_interval=config.training.evaluation_interval,
+            gradient_accumulation=config.training.gradient_accumulation,
+            lr_scheduler_interval=config.scheduler.update_interval,
+            checkpointing_save_interval=config.checkpointing.save_interval,
+        )
+        self.callbacks = callbacks or []
+        self.callbacks += self.default_callbacks()
+        self.load_wandb()
+        self.load_models()
+        self.prepare_models()
+        self.prepare_checkpointing()
+
+    def default_callbacks(self) -> list[Callback[Any]]:
+        return [
+            ClockCallback(),
+            MonitorLoss(),
+            GradientNormLogging(),
+            GradientValueClipping(),
+            GradientNormClipping(),
+            DropoutCallback(),
+        ]
+
+    @cached_property
+    def device(self) -> Device:
+        selected_device = Device(device=f"cuda:{self.config.training.gpu_index}")
+        logger.info(f"Using device: {selected_device}")
+        return selected_device
+
+    @property
+    def parameters(self) -> list[Parameter]:
+        """Returns a list of all parameters in all models"""
+        return [param for model in self.models.values() for param in model.parameters()]
+
+    @property
+    def learnable_parameters(self) -> list[Parameter]:
+        """Returns a list of learnable parameters in all models"""
+        return [param for model in self.models.values() for param in model.parameters() if param.requires_grad]
+
+    @property
+    def learnable_parameter_count(self) -> int:
+        """Returns the number of learnable parameters in all models"""
+        return count_learnable_parameters(parameters=self.learnable_parameters)
+
+    @property
+    def gradients(self) -> list[Tensor]:
+        """Returns a list of detached gradients for all learnable parameters in all models"""
+        return [
+            param.grad.detach()
+            for model in self.models.values()
+            for param in model.parameters()
+            if param.grad is not None
+        ]
+
+    @property
+    def total_gradient_norm(self) -> float:
+        """Returns the total gradient norm for all learnable parameters in all models"""
+        return compute_grad_norm(parameters=self.parameters)
+
+    @cached_property
+    def optimizer(self) -> Optimizer:
+        formatted_param_count = human_readable_number(number=self.learnable_parameter_count)
+        logger.info(f"Total number of learnable parameters in the model(s): {formatted_param_count}")
+        optimizer = self.config.optimizer.get(model_parameters=self.learnable_parameters)
+        return optimizer
+
+    @cached_property
+    def lr_scheduler(self) -> LRScheduler:
+        config = self.config.scheduler
+        step_size = self.clock.convert_time_unit_to_steps(
+            number=config.update_interval["number"], unit=config.update_interval["unit"]
+        )
+
+        match config.scheduler_type:
+            case SchedulerType.CONSTANT_LR:
+                lr_scheduler = LambdaLR(optimizer=self.optimizer, lr_lambda=lambda _: 1.0)
+            case SchedulerType.STEP_LR:
+                lr_scheduler = StepLR(optimizer=self.optimizer, step_size=step_size, gamma=config.gamma)
+            case SchedulerType.EXPONENTIAL_LR:
+                lr_scheduler = ExponentialLR(optimizer=self.optimizer, gamma=config.gamma)
+            case SchedulerType.COSINE_ANNEALING_LR:
+                lr_scheduler = CosineAnnealingLR(optimizer=self.optimizer, T_max=step_size, eta_min=config.eta_min)
+            case SchedulerType.REDUCE_LR_ON_PLATEAU:
+                lr_scheduler = cast(
+                    LRScheduler,
+                    ReduceLROnPlateau(
+                        optimizer=self.optimizer,
+                        mode=config.mode,
+                        factor=config.factor,
+                        patience=config.patience,
+                        threshold=config.threshold,
+                        cooldown=config.cooldown,
+                        min_lr=config.min_lr,
+                    ),
+                )
+            case SchedulerType.LAMBDA_LR:
+                assert config.lr_lambda is not None, "lr_lambda must be specified to use LambdaLR"
+                lr_scheduler = LambdaLR(optimizer=self.optimizer, lr_lambda=config.lr_lambda)
+            case SchedulerType.ONE_CYCLE_LR:
+                lr_scheduler = OneCycleLR(optimizer=self.optimizer, max_lr=config.max_lr, total_steps=step_size)
+            case SchedulerType.MULTIPLICATIVE_LR:
+                assert config.lr_lambda is not None, "lr_lambda must be specified to use MultiplicativeLR"
+                lr_scheduler = MultiplicativeLR(optimizer=self.optimizer, lr_lambda=config.lr_lambda)
+            case SchedulerType.COSINE_ANNEALING_WARM_RESTARTS:
+                lr_scheduler = CosineAnnealingWarmRestarts(optimizer=self.optimizer, T_0=step_size)
+            case SchedulerType.CYCLIC_LR:
+                lr_scheduler = CyclicLR(optimizer=self.optimizer, base_lr=config.base_lr, max_lr=config.max_lr)
+            case SchedulerType.MULTI_STEP_LR:
+                lr_scheduler = MultiStepLR(optimizer=self.optimizer, milestones=config.milestones, gamma=config.gamma)
+            case _:
+                raise ValueError(f"Unknown scheduler type: {config.scheduler_type}")
+
+        warmup_steps = self.clock.convert_time_unit_to_steps(number=config.warmup["number"], unit=config.warmup["unit"])
+        if warmup_steps > 0:
+            lr_scheduler = WarmupScheduler(
+                optimizer=self.optimizer,
+                scheduler=lr_scheduler,
+                warmup_steps=warmup_steps,
+            )
+
+        return lr_scheduler
+
+    @cached_property
+    def models(self) -> dict[str, fl.Module]:
+        return self.load_models()
+
+    def set_models_to_train_mode(self) -> None:
+        for model in self.models.values():
+            model.train()
+
+    def set_models_to_eval_mode(self) -> None:
+        for model in self.models.values():
+            model.eval()
+
+    def log(self, data: dict[str, WandbLoggable]) -> None:
+        self.wandb.log(data=data, step=self.clock.step)
+
+    def load_wandb(self) -> None:
+        init_config = {**self.config.wandb.model_dump(), "config": self.config.model_dump()}
+        self.wandb = WandbLogger(init_config=init_config)
+
+    def prepare_model(self, model_name: str) -> None:
+        model = self.models[model_name]
+        if (checkpoint := self.config.models[model_name].checkpoint) is not None:
+            model.load_from_safetensors(tensors_path=checkpoint)
+        else:
+            logger.info(f"No checkpoint found. Initializing model `{model_name}` from scratch.")
+        model.requires_grad_(requires_grad=self.config.models[model_name].train)
+        model.to(self.device)
+        model.zero_grad()
+
+    def prepare_models(self) -> None:
+        assert self.models, "No models found."
+        for model_name in self.models:
+            self.prepare_model(model_name=model_name)
+
+    def prepare_checkpointing(self) -> None:
+        if self.config.checkpointing.save_folder is not None:
+            assert self.config.checkpointing.save_folder.is_dir()
+            self.checkpoints_save_folder = (
+                self.config.checkpointing.save_folder / self.wandb.project_name / self.wandb.run_name
+            )
+            self.checkpoints_save_folder.mkdir(parents=True, exist_ok=False)
+            logger.info(f"Checkpointing enabled: {self.checkpoints_save_folder}")
+        else:
+            self.checkpoints_save_folder = None
+            logger.info("Checkpointing disabled: configure `save_folder` to turn it on.")
+
+    def load_models(self) -> dict[str, fl.Module]:
+        raise NotImplementedError("The `load_models` method must be implemented in the subclass.")
+
+    def load_dataset(self) -> Dataset[Batch]:
+        raise NotImplementedError("The `load_dataset` method must be implemented in the subclass.")
+
+    @cached_property
+    def dataset(self) -> Dataset[Batch]:
+        return self.load_dataset()
+
+    @cached_property
+    def dataset_length(self) -> int:
+        assert hasattr(self.dataset, "__len__"), "The dataset must implement the `__len__` method."
+        return len(self.dataset)  # type: ignore
+
+    @cached_property
+    def dataloader(self) -> DataLoader[Batch]:
+        collate_fn = getattr(self.dataset, "collate_fn", None)
+        return DataLoader(
+            dataset=self.dataset, batch_size=self.config.training.batch_size, shuffle=True, collate_fn=collate_fn
+        )
+
+    @property
+    def checkpointing_enabled(self) -> bool:
+        return self.checkpoints_save_folder is not None
+
+    @property
+    def ensure_checkpoints_save_folder(self) -> Path:
+        assert self.checkpoints_save_folder is not None
+        return self.checkpoints_save_folder
+
+    def compute_loss(self, batch: Batch) -> Tensor:
+        raise NotImplementedError("The `compute_loss` method must be implemented in the subclass.")
+
+    def compute_evaluation(self) -> None:
+        pass
+
+    def backward(self) -> None:
+        """Backward pass on the loss."""
+        self._call_callbacks(event_name="on_backward_begin")
+        scaled_loss = self.loss / self.clock.num_step_per_iteration
+        backward(tensors=scaled_loss)
+        self._call_callbacks(event_name="on_backward_end")
+        if self.clock.is_optimizer_step:
+            self._call_callbacks(event_name="on_optimizer_step_begin")
+            self.optimizer.step()
+            self.optimizer.zero_grad()
+            self._call_callbacks(event_name="on_optimizer_step_end")
+        if self.clock.is_lr_scheduler_step:
+            self._call_callbacks(event_name="on_lr_scheduler_step_begin")
+            self.lr_scheduler.step()
+            self._call_callbacks(event_name="on_lr_scheduler_step_end")
+        if self.clock.is_evaluation_step:
+            self.evaluate()
+        if self.checkpointing_enabled and self.clock.is_checkpointing_step:
+            self._call_callbacks(event_name="on_checkpoint_save")
+
+    def step(self, batch: Batch) -> None:
+        """Perform a single training step."""
+        self._call_callbacks(event_name="on_compute_loss_begin")
+        loss = self.compute_loss(batch=batch)
+        self.loss = loss
+        self._call_callbacks(event_name="on_compute_loss_end")
+        self.backward()
+
+    def epoch(self) -> None:
+        """Perform a single epoch."""
+        for batch in self.dataloader:
+            self._call_callbacks(event_name="on_batch_begin")
+            self.step(batch=batch)
+            self._call_callbacks(event_name="on_batch_end")
+
+    @staticmethod
+    def get_training_seed(instance: "Trainer[BaseConfig, Any]") -> int:
+        return instance.config.training.seed
+
+    @scoped_seed(seed=get_training_seed)
+    def train(self) -> None:
+        """Train the model."""
+        self.set_models_to_train_mode()
+        self._call_callbacks(event_name="on_train_begin")
+        assert self.learnable_parameters, "There are no learnable parameters in the models."
+        self.evaluate()
+        while not self.clock.done:
+            self._call_callbacks(event_name="on_epoch_begin")
+            self.epoch()
+            self._call_callbacks(event_name="on_epoch_end")
+        self._call_callbacks(event_name="on_train_end")
+
+    @staticmethod
+    def get_evaluation_seed(instance: "Trainer[BaseConfig, Any]") -> int:
+        return instance.config.training.evaluation_seed
+
+    @no_grad()
+    @scoped_seed(seed=get_evaluation_seed)
+    def evaluate(self) -> None:
+        """Evaluate the model."""
+        self.set_models_to_eval_mode()
+        self._call_callbacks(event_name="on_evaluate_begin")
+        self.compute_evaluation()
+        self._call_callbacks(event_name="on_evaluate_end")
+        self.set_models_to_train_mode()
+
+    def _call_callbacks(self, event_name: str) -> None:
+        for callback in self.callbacks:
+            getattr(callback, event_name)(self)
--- a/src/refiners/training_utils/wandb.py
+++ b/src/refiners/training_utils/wandb.py
@ -0,0 +1,61 @@
+from typing import Any
+import wandb
+from PIL import Image
+
+__all__ = [
+    "WandbLogger",
+    "WandbLoggable",
+]
+
+
+number = float | int
+WandbLoggable = number | Image.Image | list[number] | dict[str, list[number]]
+
+
+def convert_to_wandb(value: WandbLoggable) -> Any:
+    match value:
+        case Image.Image():
+            return convert_to_wandb_image(value=value)
+        case list():
+            return convert_to_wandb_histogram(value=value)
+        case dict():
+            return convert_to_wandb_table(value=value)
+        case _:
+            return value
+
+
+def convert_to_wandb_image(value: Image.Image) -> wandb.Image:
+    return wandb.Image(data_or_path=value)
+
+
+def convert_to_wandb_histogram(value: list[number]) -> wandb.Histogram:
+    return wandb.Histogram(sequence=value)
+
+
+def convert_to_wandb_table(value: dict[str, list[number]]) -> wandb.Table:
+    assert all(
+        isinstance(v, list) and len(v) == len(next(iter(value.values()))) for v in value.values()
+    ), "Expected a dictionary of lists of the same size"
+    columns = list(value.keys())
+    data_rows = list(zip(*value.values()))
+    return wandb.Table(columns=columns, data=data_rows)
+
+
+class WandbLogger:
+    def __init__(self, init_config: dict[str, Any] = {}) -> None:
+        self.wandb_run = wandb.init(**init_config)  # type: ignore
+
+    def log(self, data: dict[str, WandbLoggable], step: int) -> None:
+        converted_data = {key: convert_to_wandb(value=value) for key, value in data.items()}
+        self.wandb_run.log(converted_data, step=step)  # type: ignore
+
+    def update_summary(self, key: str, value: Any) -> None:
+        self.wandb_run.summary[key] = value  # type: ignore
+
+    @property
+    def project_name(self) -> str:
+        return self.wandb_run.project_name()  # type: ignore
+
+    @property
+    def run_name(self) -> str:
+        return self.wandb_run.name or ""  # type: ignore
--- a/tests/init.py
+++ b/tests/init.py
--- a/tests/adapters/test_adapter.py
+++ b/tests/adapters/test_adapter.py
@ -0,0 +1,82 @@
+import pytest
+from refiners.adapters.adapter import Adapter
+from refiners.fluxion.layers import Chain, Linear
+
+
+class DummyLinearAdapter(Chain, Adapter[Linear]):
+    def __init__(self, target: Linear):
+        with self.setup_adapter(target):
+            super().__init__(target)
+
+
+class DummyChainAdapter(Chain, Adapter[Chain]):
+    def __init__(self, target: Chain):
+        with self.setup_adapter(target):
+            super().__init__(target)
+
+
+@pytest.fixture
+def chain() -> Chain:
+    return Chain(Chain(Linear(2, 2)))
+
+
+def test_weighted_module_adapter_insertion(chain: Chain):
+    parent = chain.Chain
+    adaptee = parent.Linear
+
+    adapter = DummyLinearAdapter(adaptee)
+
+    adapter.inject(parent)
+    assert adapter.parent == parent
+    assert adapter in iter(parent)
+    assert adaptee not in iter(parent)
+
+    adapter.eject()
+    assert adapter.parent is None
+    assert adapter not in iter(parent)
+    assert adaptee in iter(parent)
+
+
+def test_chain_adapter_insertion(chain: Chain):
+    parent = chain
+    adaptee = parent.Chain
+
+    adapter = DummyChainAdapter(adaptee)
+    assert adaptee.parent == parent
+
+    adapter.inject()
+    assert adapter.parent == parent
+    assert adaptee.parent == adapter
+    assert adapter in iter(parent)
+    assert adaptee not in iter(parent)
+
+    adapter.eject()
+    assert adapter.parent is None
+    assert adaptee.parent == parent
+    assert adapter not in iter(parent)
+    assert adaptee in iter(parent)
+
+
+def test_weighted_module_adapter_structural_copy(chain: Chain):
+    parent = chain.Chain
+    adaptee = parent.Linear
+
+    adapter = DummyLinearAdapter(adaptee)
+    adapter.inject(parent)
+
+    clone = chain.structural_copy()
+    cloned_adapter = clone.Chain.DummyLinearAdapter
+    assert cloned_adapter.parent == clone.Chain
+    assert cloned_adapter.target == adaptee
+
+
+def test_chain_adapter_structural_copy(chain: Chain):
+    # Chain adapters cannot be copied by default.
+    adapter = DummyChainAdapter(chain.Chain)
+    adapter.inject()
+
+    with pytest.raises(RuntimeError):
+        chain.structural_copy()
+
+    adapter.eject()
+    chain.structural_copy()
--- a/tests/adapters/test_lora.py
+++ b/tests/adapters/test_lora.py
@ -0,0 +1,29 @@
+from refiners.adapters.lora import Lora, LoraAdapter
+from torch import randn, allclose
+import refiners.fluxion.layers as fl
+
+
+def test_lora() -> None:
+    chain = fl.Chain(
+        fl.Chain(
+            fl.Linear(in_features=1, out_features=1),
+            fl.Linear(in_features=1, out_features=1),
+        ),
+        fl.Linear(in_features=1, out_features=2),
+    )
+    x = randn(1, 1)
+    y = chain(x)
+
+    lora_adapter = LoraAdapter(chain.Chain.Linear_1)
+    lora_adapter.inject(chain.Chain)
+
+    assert isinstance(lora_adapter[1], Lora)
+    assert allclose(input=chain(x), other=y)
+    assert lora_adapter.parent == chain.Chain
+
+    lora_adapter.eject()
+    assert isinstance(chain.Chain[0], fl.Linear)
+    assert len(chain) == 2
+
+    lora_adapter.inject(chain.Chain)
+    assert isinstance(chain.Chain[0], LoraAdapter)
--- a/tests/adapters/test_range_adapter.py
+++ b/tests/adapters/test_range_adapter.py
@ -0,0 +1,25 @@
+import torch
+from refiners.adapters.adapter import Adapter
+from refiners.adapters.range_adapter import RangeEncoder
+from refiners.fluxion.layers import Chain, Linear
+
+
+class DummyLinearAdapter(Chain, Adapter[Linear]):
+    def __init__(self, target: Linear):
+        with self.setup_adapter(target):
+            super().__init__(target)
+
+
+def test_range_encoder_dtype_after_adaptation(test_device: torch.device):  # FG-433
+    dtype = torch.float64
+    chain = Chain(RangeEncoder(320, 1280, device=test_device, dtype=dtype))
+
+    adaptee = chain.RangeEncoder.Linear_1
+    adapter = DummyLinearAdapter(adaptee)
+    adapter.inject(chain.RangeEncoder)
+
+    assert adapter.parent == chain.RangeEncoder
+
+    x = torch.tensor([42], device=test_device)
+    y = chain(x)
+    assert y.dtype == dtype
--- a/tests/conftest.py
+++ b/tests/conftest.py
@ -0,0 +1,25 @@
+import os
+import torch
+from pathlib import Path
+from pytest import fixture
+
+PARENT_PATH = Path(__file__).parent
+
+
+@fixture(scope="session")
+def test_device() -> torch.device:
+    test_device = os.getenv("REFINERS_TEST_DEVICE")
+    if not test_device:
+        return torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
+    return torch.device(test_device)
+
+
+@fixture(scope="session")
+def test_weights_path() -> Path:
+    from_env = os.getenv("REFINERS_TEST_WEIGHTS_DIR")
+    return Path(from_env) if from_env else PARENT_PATH / "weights"
+
+
+@fixture(scope="session")
+def test_e2e_path() -> Path:
+    return PARENT_PATH / "e2e"
--- a/tests/e2e/test_diffusion.py
+++ b/tests/e2e/test_diffusion.py
@ -0,0 +1,709 @@
+import torch
+import pytest
+
+from typing import Iterator
+
+from warnings import warn
+from PIL import Image
+from pathlib import Path
+
+from refiners.fluxion.utils import load_from_safetensors, image_to_tensor, manual_seed
+from refiners.foundationals.latent_diffusion import StableDiffusion_1, StableDiffusion_1_Inpainting
+from refiners.foundationals.latent_diffusion.unet import UNet
+from refiners.foundationals.latent_diffusion.controlnet import Controlnet
+from refiners.foundationals.latent_diffusion.lora import LoraWeights
+from refiners.foundationals.latent_diffusion.schedulers import DDIM
+from refiners.foundationals.latent_diffusion.self_attention_injection import SelfAttentionInjection
+
+from tests.utils import ensure_similar_images
+
+
+@pytest.fixture(scope="module")
+def ref_path(test_e2e_path: Path) -> Path:
+    return test_e2e_path / "test_diffusion_ref"
+
+
+@pytest.fixture(scope="module")
+def cutecat_init(ref_path: Path) -> Image.Image:
+    return Image.open(ref_path / "cutecat_init.png").convert("RGB")
+
+
+@pytest.fixture(scope="module")
+def kitchen_dog(ref_path: Path) -> Image.Image:
+    return Image.open(ref_path / "kitchen_dog.png").convert("RGB")
+
+
+@pytest.fixture(scope="module")
+def kitchen_dog_mask(ref_path: Path) -> Image.Image:
+    return Image.open(ref_path / "kitchen_dog_mask.png").convert("RGB")
+
+
+@pytest.fixture
+def expected_image_std_random_init(ref_path: Path) -> Image.Image:
+    return Image.open(ref_path / "expected_std_random_init.png").convert("RGB")
+
+
+@pytest.fixture
+def expected_image_std_init_image(ref_path: Path) -> Image.Image:
+    return Image.open(ref_path / "expected_std_init_image.png").convert("RGB")
+
+
+@pytest.fixture
+def expected_image_std_inpainting(ref_path: Path) -> Image.Image:
+    return Image.open(ref_path / "expected_std_inpainting.png").convert("RGB")
+
+
+@pytest.fixture(scope="module", params=["canny", "depth", "lineart", "normals", "sam"])
+def controlnet_data(
+    ref_path: Path, test_weights_path: Path, request: pytest.FixtureRequest
+) -> Iterator[tuple[str, Image.Image, Image.Image, Path]]:
+    cn_name: str = request.param
+    condition_image = Image.open(ref_path / f"cutecat_guide_{cn_name}.png").convert("RGB")
+    expected_image = Image.open(ref_path / f"expected_controlnet_{cn_name}.png").convert("RGB")
+    weights_fn = {
+        "depth": "lllyasviel_control_v11f1p_sd15_depth",
+        "canny": "lllyasviel_control_v11p_sd15_canny",
+        "lineart": "lllyasviel_control_v11p_sd15_lineart",
+        "normals": "lllyasviel_control_v11p_sd15_normalbae",
+        "sam": "mfidabel_controlnet-segment-anything",
+    }
+
+    weights_path = test_weights_path / "controlnet" / f"{weights_fn[cn_name]}.safetensors"
+    yield (cn_name, condition_image, expected_image, weights_path)
+
+
+@pytest.fixture(scope="module")
+def controlnet_data_canny(ref_path: Path, test_weights_path: Path) -> tuple[str, Image.Image, Image.Image, Path]:
+    cn_name = "canny"
+    condition_image = Image.open(ref_path / f"cutecat_guide_{cn_name}.png").convert("RGB")
+    expected_image = Image.open(ref_path / f"expected_controlnet_{cn_name}.png").convert("RGB")
+    weights_path = test_weights_path / "controlnet" / "lllyasviel_control_v11p_sd15_canny.safetensors"
+    return cn_name, condition_image, expected_image, weights_path
+
+
+@pytest.fixture(scope="module")
+def lora_data_pokemon(ref_path: Path, test_weights_path: Path) -> tuple[Image.Image, Path]:
+    expected_image = Image.open(ref_path / "expected_lora_pokemon.png").convert("RGB")
+    weights_path = test_weights_path / "loras" / "pcuenq_pokemon_lora.safetensors"
+    return expected_image, weights_path
+
+
+@pytest.fixture
+def scene_image_inpainting_refonly(ref_path: Path) -> Image.Image:
+    return Image.open(ref_path / "inpainting-scene.png").convert("RGB")
+
+
+@pytest.fixture
+def mask_image_inpainting_refonly(ref_path: Path) -> Image.Image:
+    return Image.open(ref_path / "inpainting-mask.png").convert("RGB")
+
+
+@pytest.fixture
+def target_image_inpainting_refonly(ref_path: Path) -> Image.Image:
+    return Image.open(ref_path / "inpainting-target.png").convert("RGB")
+
+
+@pytest.fixture
+def expected_image_inpainting_refonly(ref_path: Path) -> Image.Image:
+    return Image.open(ref_path / "expected_inpainting_refonly.png").convert("RGB")
+
+
+@pytest.fixture
+def expected_image_refonly(ref_path: Path) -> Image.Image:
+    return Image.open(ref_path / "expected_refonly.png").convert("RGB")
+
+
+@pytest.fixture
+def condition_image_refonly(ref_path: Path) -> Image.Image:
+    return Image.open(ref_path / "cyberpunk_guide.png").convert("RGB")
+
+
+@pytest.fixture(scope="module")
+def text_encoder_weights(test_weights_path: Path) -> Path:
+    text_encoder_weights = test_weights_path / "CLIPTextEncoderL.safetensors"
+    if not text_encoder_weights.is_file():
+        warn(f"could not find weights at {text_encoder_weights}, skipping")
+        pytest.skip(allow_module_level=True)
+    return text_encoder_weights
+
+
+@pytest.fixture(scope="module")
+def lda_weights(test_weights_path: Path) -> Path:
+    lda_weights = test_weights_path / "lda.safetensors"
+    if not lda_weights.is_file():
+        warn(f"could not find weights at {lda_weights}, skipping")
+        pytest.skip(allow_module_level=True)
+    return lda_weights
+
+
+@pytest.fixture(scope="module")
+def unet_weights_std(test_weights_path: Path) -> Path:
+    unet_weights_std = test_weights_path / "unet.safetensors"
+    if not unet_weights_std.is_file():
+        warn(f"could not find weights at {unet_weights_std}, skipping")
+        pytest.skip(allow_module_level=True)
+    return unet_weights_std
+
+
+@pytest.fixture(scope="module")
+def unet_weights_inpainting(test_weights_path: Path) -> Path:
+    unet_weights_inpainting = test_weights_path / "inpainting" / "unet.safetensors"
+    if not unet_weights_inpainting.is_file():
+        warn(f"could not find weights at {unet_weights_inpainting}, skipping")
+        pytest.skip(allow_module_level=True)
+    return unet_weights_inpainting
+
+
+@pytest.fixture
+def sd15_std(
+    text_encoder_weights: Path, lda_weights: Path, unet_weights_std: Path, test_device: torch.device
+) -> StableDiffusion_1:
+    if test_device.type == "cpu":
+        warn("not running on CPU, skipping")
+        pytest.skip()
+
+    sd15 = StableDiffusion_1(device=test_device)
+
+    sd15.clip_text_encoder.load_state_dict(load_from_safetensors(text_encoder_weights))
+    sd15.lda.load_state_dict(load_from_safetensors(lda_weights))
+    sd15.unet.load_state_dict(load_from_safetensors(unet_weights_std))
+
+    return sd15
+
+
+@pytest.fixture
+def sd15_std_float16(
+    text_encoder_weights: Path, lda_weights: Path, unet_weights_std: Path, test_device: torch.device
+) -> StableDiffusion_1:
+    if test_device.type == "cpu":
+        warn("not running on CPU, skipping")
+        pytest.skip()
+
+    sd15 = StableDiffusion_1(device=test_device, dtype=torch.float16)
+
+    sd15.clip_text_encoder.load_state_dict(load_from_safetensors(text_encoder_weights))
+    sd15.lda.load_state_dict(load_from_safetensors(lda_weights))
+    sd15.unet.load_state_dict(load_from_safetensors(unet_weights_std))
+
+    return sd15
+
+
+@pytest.fixture
+def sd15_inpainting(
+    text_encoder_weights: Path, lda_weights: Path, unet_weights_inpainting: Path, test_device: torch.device
+) -> StableDiffusion_1_Inpainting:
+    if test_device.type == "cpu":
+        warn("not running on CPU, skipping")
+        pytest.skip()
+
+    unet = UNet(in_channels=9, clip_embedding_dim=768)
+    sd15 = StableDiffusion_1_Inpainting(unet=unet, device=test_device)
+
+    sd15.clip_text_encoder.load_state_dict(load_from_safetensors(text_encoder_weights))
+    sd15.lda.load_state_dict(load_from_safetensors(lda_weights))
+    sd15.unet.load_state_dict(load_from_safetensors(unet_weights_inpainting))
+
+    return sd15
+
+
+@pytest.fixture
+def sd15_inpainting_float16(
+    text_encoder_weights: Path, lda_weights: Path, unet_weights_inpainting: Path, test_device: torch.device
+) -> StableDiffusion_1_Inpainting:
+    if test_device.type == "cpu":
+        warn("not running on CPU, skipping")
+        pytest.skip()
+
+    unet = UNet(in_channels=9, clip_embedding_dim=768)
+    sd15 = StableDiffusion_1_Inpainting(unet=unet, device=test_device, dtype=torch.float16)
+
+    sd15.clip_text_encoder.load_state_dict(load_from_safetensors(text_encoder_weights))
+    sd15.lda.load_state_dict(load_from_safetensors(lda_weights))
+    sd15.unet.load_state_dict(load_from_safetensors(unet_weights_inpainting))
+
+    return sd15
+
+
+@pytest.fixture
+def sd15_ddim(
+    text_encoder_weights: Path, lda_weights: Path, unet_weights_std: Path, test_device: torch.device
+) -> StableDiffusion_1:
+    if test_device.type == "cpu":
+        warn("not running on CPU, skipping")
+        pytest.skip()
+
+    ddim_scheduler = DDIM(num_inference_steps=20)
+    sd15 = StableDiffusion_1(scheduler=ddim_scheduler, device=test_device)
+
+    sd15.clip_text_encoder.load_state_dict(load_from_safetensors(text_encoder_weights))
+    sd15.lda.load_state_dict(load_from_safetensors(lda_weights))
+    sd15.unet.load_state_dict(load_from_safetensors(unet_weights_std))
+
+    return sd15
+
+
+@torch.no_grad()
+def test_diffusion_std_random_init(
+    sd15_std: StableDiffusion_1, expected_image_std_random_init: Image.Image, test_device: torch.device
+):
+    sd15 = sd15_std
+    n_steps = 30
+
+    prompt = "a cute cat, detailed high-quality professional image"
+    negative_prompt = "lowres, bad anatomy, bad hands, cropped, worst quality"
+
+    with torch.no_grad():
+        clip_text_embedding = sd15.compute_text_embedding(prompt)
+        negative_clip_text_embedding = sd15.compute_text_embedding(negative_prompt)
+
+    sd15.set_num_inference_steps(n_steps)
+
+    manual_seed(2)
+    x = torch.randn(1, 4, 64, 64, device=test_device)
+
+    with torch.no_grad():
+        for step in sd15.steps:
+            x = sd15(
+                x,
+                step=step,
+                clip_text_embedding=clip_text_embedding,
+                negative_clip_text_embedding=negative_clip_text_embedding,
+                condition_scale=7.5,
+            )
+        predicted_image = sd15.lda.decode_latents(x)
+
+    ensure_similar_images(predicted_image, expected_image_std_random_init)
+
+
+@torch.no_grad()
+def test_diffusion_std_random_init_float16(
+    sd15_std_float16: StableDiffusion_1, expected_image_std_random_init: Image.Image, test_device: torch.device
+):
+    sd15 = sd15_std_float16
+    n_steps = 30
+
+    prompt = "a cute cat, detailed high-quality professional image"
+    negative_prompt = "lowres, bad anatomy, bad hands, cropped, worst quality"
+
+    with torch.no_grad():
+        clip_text_embedding = sd15.compute_text_embedding(prompt)
+        negative_clip_text_embedding = sd15.compute_text_embedding(negative_prompt)
+
+    assert clip_text_embedding.dtype == torch.float16
+    assert negative_clip_text_embedding.dtype == torch.float16
+
+    sd15.set_num_inference_steps(n_steps)
+
+    manual_seed(2)
+    x = torch.randn(1, 4, 64, 64, device=test_device, dtype=torch.float16)
+
+    with torch.no_grad():
+        for step in sd15.steps:
+            x = sd15(
+                x,
+                step=step,
+                clip_text_embedding=clip_text_embedding,
+                negative_clip_text_embedding=negative_clip_text_embedding,
+                condition_scale=7.5,
+            )
+        predicted_image = sd15.lda.decode_latents(x)
+
+    ensure_similar_images(predicted_image, expected_image_std_random_init, min_psnr=35, min_ssim=0.98)
+
+
+@torch.no_grad()
+def test_diffusion_std_init_image(
+    sd15_std: StableDiffusion_1,
+    cutecat_init: Image.Image,
+    expected_image_std_init_image: Image.Image,
+):
+    sd15 = sd15_std
+    n_steps = 35
+    first_step = 5
+
+    prompt = "a cute cat, detailed high-quality professional image"
+    negative_prompt = "lowres, bad anatomy, bad hands, cropped, worst quality"
+
+    with torch.no_grad():
+        clip_text_embedding = sd15.compute_text_embedding(prompt)
+        negative_clip_text_embedding = sd15.compute_text_embedding(negative_prompt)
+
+    sd15.set_num_inference_steps(n_steps)
+
+    manual_seed(2)
+    x = sd15.init_latents((512, 512), cutecat_init, first_step=first_step)
+
+    with torch.no_grad():
+        for step in sd15.steps[first_step:]:
+            x = sd15(
+                x,
+                step=step,
+                clip_text_embedding=clip_text_embedding,
+                negative_clip_text_embedding=negative_clip_text_embedding,
+                condition_scale=7.5,
+            )
+        predicted_image = sd15.lda.decode_latents(x)
+
+    ensure_similar_images(predicted_image, expected_image_std_init_image)
+
+
+@torch.no_grad()
+def test_diffusion_inpainting(
+    sd15_inpainting: StableDiffusion_1_Inpainting,
+    kitchen_dog: Image.Image,
+    kitchen_dog_mask: Image.Image,
+    expected_image_std_inpainting: Image.Image,
+    test_device: torch.device,
+):
+    sd15 = sd15_inpainting
+    n_steps = 30
+
+    prompt = "a large white cat, detailed high-quality professional image, sitting on a chair, in a kitchen"
+    negative_prompt = "lowres, bad anatomy, bad hands, cropped, worst quality"
+
+    with torch.no_grad():
+        clip_text_embedding = sd15.compute_text_embedding(prompt)
+        negative_clip_text_embedding = sd15.compute_text_embedding(negative_prompt)
+
+    sd15.set_num_inference_steps(n_steps)
+    sd15.set_inpainting_conditions(kitchen_dog, kitchen_dog_mask)
+
+    manual_seed(2)
+    x = torch.randn(1, 4, 64, 64, device=test_device)
+
+    with torch.no_grad():
+        for step in sd15.steps:
+            x = sd15(
+                x,
+                step=step,
+                clip_text_embedding=clip_text_embedding,
+                negative_clip_text_embedding=negative_clip_text_embedding,
+                condition_scale=7.5,
+            )
+        predicted_image = sd15.lda.decode_latents(x)
+
+    # PSNR and SSIM values are large because with float32 we get large differences even v.s. ourselves.
+    ensure_similar_images(predicted_image, expected_image_std_inpainting, min_psnr=25, min_ssim=0.95)
+
+
+@torch.no_grad()
+def test_diffusion_inpainting_float16(
+    sd15_inpainting_float16: StableDiffusion_1_Inpainting,
+    kitchen_dog: Image.Image,
+    kitchen_dog_mask: Image.Image,
+    expected_image_std_inpainting: Image.Image,
+    test_device: torch.device,
+):
+    sd15 = sd15_inpainting_float16
+    n_steps = 30
+
+    prompt = "a large white cat, detailed high-quality professional image, sitting on a chair, in a kitchen"
+    negative_prompt = "lowres, bad anatomy, bad hands, cropped, worst quality"
+
+    with torch.no_grad():
+        clip_text_embedding = sd15.compute_text_embedding(prompt)
+        negative_clip_text_embedding = sd15.compute_text_embedding(negative_prompt)
+
+    assert clip_text_embedding.dtype == torch.float16
+    assert negative_clip_text_embedding.dtype == torch.float16
+
+    sd15.set_num_inference_steps(n_steps)
+    sd15.set_inpainting_conditions(kitchen_dog, kitchen_dog_mask)
+
+    manual_seed(2)
+    x = torch.randn(1, 4, 64, 64, device=test_device, dtype=torch.float16)
+
+    with torch.no_grad():
+        for step in sd15.steps:
+            x = sd15(
+                x,
+                step=step,
+                clip_text_embedding=clip_text_embedding,
+                negative_clip_text_embedding=negative_clip_text_embedding,
+                condition_scale=7.5,
+            )
+        predicted_image = sd15.lda.decode_latents(x)
+
+    # PSNR and SSIM values are large because float16 is even worse than float32.
+    ensure_similar_images(predicted_image, expected_image_std_inpainting, min_psnr=20, min_ssim=0.92)
+
+
+@torch.no_grad()
+def test_diffusion_controlnet(
+    sd15_std: StableDiffusion_1,
+    controlnet_data: tuple[str, Image.Image, Image.Image, Path],
+    test_device: torch.device,
+):
+    sd15 = sd15_std
+    n_steps = 30
+
+    cn_name, condition_image, expected_image, cn_weights_path = controlnet_data
+
+    if not cn_weights_path.is_file():
+        warn(f"could not find weights at {cn_weights_path}, skipping")
+        pytest.skip(allow_module_level=True)
+
+    prompt = "a cute cat, detailed high-quality professional image"
+    negative_prompt = "lowres, bad anatomy, bad hands, cropped, worst quality"
+
+    with torch.no_grad():
+        clip_text_embedding = sd15.compute_text_embedding(prompt)
+        negative_clip_text_embedding = sd15.compute_text_embedding(negative_prompt)
+
+    sd15.set_num_inference_steps(n_steps)
+
+    controlnet_state_dict = load_from_safetensors(cn_weights_path)
+    controlnet = Controlnet(name=cn_name, device=test_device)
+    controlnet.load_state_dict(controlnet_state_dict)
+    controlnet.set_scale(0.5)
+    sd15.unet.insert(0, controlnet)
+
+    cn_condition = image_to_tensor(condition_image.convert("RGB"), device=test_device)
+
+    manual_seed(2)
+    x = torch.randn(1, 4, 64, 64, device=test_device)
+
+    with torch.no_grad():
+        for step in sd15.steps:
+            controlnet.set_controlnet_condition(cn_condition)
+            x = sd15(
+                x,
+                step=step,
+                clip_text_embedding=clip_text_embedding,
+                negative_clip_text_embedding=negative_clip_text_embedding,
+                condition_scale=7.5,
+            )
+        predicted_image = sd15.lda.decode_latents(x)
+
+    ensure_similar_images(predicted_image, expected_image, min_psnr=35, min_ssim=0.98)
+
+
+@torch.no_grad()
+def test_diffusion_controlnet_structural_copy(
+    sd15_std: StableDiffusion_1,
+    controlnet_data_canny: tuple[str, Image.Image, Image.Image, Path],
+    test_device: torch.device,
+):
+    sd15_base = sd15_std
+    sd15 = sd15_base.structural_copy()
+    n_steps = 30
+
+    cn_name, condition_image, expected_image, cn_weights_path = controlnet_data_canny
+
+    if not cn_weights_path.is_file():
+        warn(f"could not find weights at {cn_weights_path}, skipping")
+        pytest.skip(allow_module_level=True)
+
+    prompt = "a cute cat, detailed high-quality professional image"
+    negative_prompt = "lowres, bad anatomy, bad hands, cropped, worst quality"
+
+    with torch.no_grad():
+        clip_text_embedding = sd15.compute_text_embedding(prompt)
+        negative_clip_text_embedding = sd15.compute_text_embedding(negative_prompt)
+
+    sd15.set_num_inference_steps(n_steps)
+
+    controlnet_state_dict = load_from_safetensors(cn_weights_path)
+    controlnet = Controlnet(name=cn_name, device=test_device)
+    controlnet.load_state_dict(controlnet_state_dict)
+    controlnet.set_scale(0.5)
+    sd15.unet.insert(0, controlnet)
+
+    cn_condition = image_to_tensor(condition_image.convert("RGB"), device=test_device)
+
+    manual_seed(2)
+    x = torch.randn(1, 4, 64, 64, device=test_device)
+
+    with torch.no_grad():
+        for step in sd15.steps:
+            controlnet.set_controlnet_condition(cn_condition)
+            x = sd15(
+                x,
+                step=step,
+                clip_text_embedding=clip_text_embedding,
+                negative_clip_text_embedding=negative_clip_text_embedding,
+                condition_scale=7.5,
+            )
+        predicted_image = sd15.lda.decode_latents(x)
+
+    ensure_similar_images(predicted_image, expected_image, min_psnr=35, min_ssim=0.98)
+
+
+@torch.no_grad()
+def test_diffusion_controlnet_float16(
+    sd15_std_float16: StableDiffusion_1,
+    controlnet_data_canny: tuple[str, Image.Image, Image.Image, Path],
+    test_device: torch.device,
+):
+    sd15 = sd15_std_float16
+    n_steps = 30
+
+    cn_name, condition_image, expected_image, cn_weights_path = controlnet_data_canny
+
+    if not cn_weights_path.is_file():
+        warn(f"could not find weights at {cn_weights_path}, skipping")
+        pytest.skip(allow_module_level=True)
+
+    prompt = "a cute cat, detailed high-quality professional image"
+    negative_prompt = "lowres, bad anatomy, bad hands, cropped, worst quality"
+
+    with torch.no_grad():
+        clip_text_embedding = sd15.compute_text_embedding(prompt)
+        negative_clip_text_embedding = sd15.compute_text_embedding(negative_prompt)
+
+    sd15.set_num_inference_steps(n_steps)
+
+    controlnet_state_dict = load_from_safetensors(cn_weights_path)
+    controlnet = Controlnet(name=cn_name, device=test_device, dtype=torch.float16)
+    controlnet.load_state_dict(controlnet_state_dict)
+    controlnet.set_scale(0.5)
+    sd15.unet.insert(0, controlnet)
+
+    cn_condition = image_to_tensor(condition_image.convert("RGB"), device=test_device, dtype=torch.float16)
+
+    manual_seed(2)
+    x = torch.randn(1, 4, 64, 64, device=test_device, dtype=torch.float16)
+
+    with torch.no_grad():
+        for step in sd15.steps:
+            controlnet.set_controlnet_condition(cn_condition)
+            x = sd15(
+                x,
+                step=step,
+                clip_text_embedding=clip_text_embedding,
+                negative_clip_text_embedding=negative_clip_text_embedding,
+                condition_scale=7.5,
+            )
+        predicted_image = sd15.lda.decode_latents(x)
+
+    ensure_similar_images(predicted_image, expected_image, min_psnr=35, min_ssim=0.98)
+
+
+@torch.no_grad()
+def test_diffusion_lora(
+    sd15_std: StableDiffusion_1,
+    lora_data_pokemon: tuple[Image.Image, Path],
+    test_device: torch.device,
+):
+    sd15 = sd15_std
+    n_steps = 30
+
+    expected_image, lora_weights_path = lora_data_pokemon
+
+    if not lora_weights_path.is_file():
+        warn(f"could not find weights at {lora_weights_path}, skipping")
+        pytest.skip(allow_module_level=True)
+
+    prompt = "a cute cat"
+
+    with torch.no_grad():
+        clip_text_embedding = sd15.compute_text_embedding(prompt)
+
+    sd15.set_num_inference_steps(n_steps)
+
+    lora_weights = LoraWeights(lora_weights_path, device=test_device)
+    lora_weights.patch(sd15, scale=1.0)
+
+    manual_seed(2)
+    x = torch.randn(1, 4, 64, 64, device=test_device)
+
+    with torch.no_grad():
+        for step in sd15.steps:
+            x = sd15(
+                x,
+                step=step,
+                clip_text_embedding=clip_text_embedding,
+                condition_scale=7.5,
+            )
+        predicted_image = sd15.lda.decode_latents(x)
+
+    ensure_similar_images(predicted_image, expected_image, min_psnr=35, min_ssim=0.98)
+
+
+@torch.no_grad()
+def test_diffusion_refonly(
+    sd15_ddim: StableDiffusion_1,
+    condition_image_refonly: Image.Image,
+    expected_image_refonly: Image.Image,
+    test_device: torch.device,
+):
+    sd15 = sd15_ddim
+    prompt = "Chicken"
+
+    with torch.no_grad():
+        clip_text_embedding = sd15.compute_text_embedding(prompt)
+
+    sai = SelfAttentionInjection(sd15.unet)
+    sai.inject()
+
+    guide = sd15.lda.encode_image(condition_image_refonly)
+    guide = torch.cat((guide, guide))
+
+    manual_seed(2)
+    x = torch.randn(1, 4, 64, 64, device=test_device)
+
+    with torch.no_grad():
+        for step in sd15.steps:
+            noise = torch.randn(2, 4, 64, 64, device=test_device)
+            noised_guide = sd15.scheduler.add_noise(guide, noise, step)
+            sai.set_controlnet_condition(noised_guide)
+            x = sd15(
+                x,
+                step=step,
+                clip_text_embedding=clip_text_embedding,
+                condition_scale=7.5,
+            )
+            torch.randn(2, 4, 64, 64, device=test_device)  # for SD Web UI reproductibility only
+        predicted_image = sd15.lda.decode_latents(x)
+
+    ensure_similar_images(predicted_image, expected_image_refonly, min_psnr=35, min_ssim=0.99)
+
+
+@torch.no_grad()
+def test_diffusion_inpainting_refonly(
+    sd15_inpainting: StableDiffusion_1_Inpainting,
+    scene_image_inpainting_refonly: Image.Image,
+    target_image_inpainting_refonly: Image.Image,
+    mask_image_inpainting_refonly: Image.Image,
+    expected_image_inpainting_refonly: Image.Image,
+    test_device: torch.device,
+):
+    sd15 = sd15_inpainting
+    n_steps = 30
+    prompt = ""  # unconditional
+
+    with torch.no_grad():
+        clip_text_embedding = sd15.compute_text_embedding(prompt)
+
+    sai = SelfAttentionInjection(sd15.unet)
+    sai.inject()
+
+    sd15.set_num_inference_steps(n_steps)
+    sd15.set_inpainting_conditions(target_image_inpainting_refonly, mask_image_inpainting_refonly)
+
+    refonly_guide = sd15.lda.encode_image(scene_image_inpainting_refonly)
+    refonly_guide = torch.cat((refonly_guide, refonly_guide))
+
+    manual_seed(2)
+    x = torch.randn(1, 4, 64, 64, device=test_device)
+
+    with torch.no_grad():
+        for step in sd15.steps:
+            refonly_noise = torch.randn_like(refonly_guide)
+            refonly_noised_guide = sd15.scheduler.add_noise(refonly_guide, refonly_noise, step)
+            # See https://github.com/Mikubill/sd-webui-controlnet/pull/1275 ("1.1.170 reference-only begin to support
+            # inpaint variation models")
+            refonly_noised_guide = torch.cat(
+                [refonly_noised_guide, torch.zeros_like(refonly_noised_guide)[:, 0:1, :, :], refonly_guide], dim=1
+            )
+
+            sai.set_controlnet_condition(refonly_noised_guide)
+            x = sd15(
+                x,
+                step=step,
+                clip_text_embedding=clip_text_embedding,
+                condition_scale=7.5,
+            )
+        predicted_image = sd15.lda.decode_latents(x)
+
+    ensure_similar_images(predicted_image, expected_image_inpainting_refonly, min_psnr=35, min_ssim=0.99)
--- a/tests/e2e/test_diffusion_ref/README.md
+++ b/tests/e2e/test_diffusion_ref/README.md
@ -0,0 +1,82 @@
+# Note about this data
+
+## Expected outputs
+
+`expected_*.png` files are the output of the same diffusion run with a different codebase, usually diffusers with the same settings as us (`DPMSolverMultistepScheduler`, VAE [patched to remove randomness](#vae-without-randomness), same seed...).
+
+For instance here is how we generate `expected_std_random_init.png`:
+
+```py
+import torch
+
+from diffusers import DPMSolverMultistepScheduler
+from diffusers import StableDiffusionPipeline
+
+pipe = StableDiffusionPipeline.from_pretrained(
+    "runwayml/stable-diffusion-v1-5",
+    torch_dtype=torch.float32,
+).to("cuda)
+pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
+
+prompt = "a cute cat, detailed high-quality professional image"
+negative_prompt = "lowres, bad anatomy, bad hands, cropped, worst quality"
+
+torch.manual_seed(2)
+output = pipe(
+    prompt=prompt,
+    negative_prompt=negative_prompt,
+    num_inference_steps=30,
+    guidance_scale=7.5,
+)
+
+output.images[0].save("std_random_init_expected.png")
+```
+
+Special cases:
+
+- `expected_refonly.png` has been generated [with Stable Diffusion web UI](https://github.com/AUTOMATIC1111/stable-diffusion-webui).
+- `expected_inpainting_refonly.png` has been generated with refiners itself (and inspected so that it looks reasonable).
+
+## Other images
+
+- `cutecat_init.png` is generated with the same Diffusers script and prompt but with seed 1234.
+
+- `kitchen_dog.png` is generated with the same Diffusers script and negative prompt, seed 12, positive prompt "a small brown dog, detailed high-quality professional image, sitting on a chair, in a kitchen".
+
+- `kitchen_mask.png` is made manually.
+
+- Controlnet guides have been manually generated using open source software and models, namely:
+    - Canny: opencv-python
+    - Depth: https://github.com/isl-org/ZoeDepth
+    - Lineart: https://github.com/lllyasviel/ControlNet-v1-1-nightly/tree/main/annotator/lineart
+    - Normals: https://github.com/baegwangbin/surface_normal_uncertainty/tree/fe2b9f1
+    - SAM: https://huggingface.co/spaces/mfidabel/controlnet-segment-anything
+
+- `cyberpunk_guide.png` [comes from Lexica](https://lexica.art/prompt/5ba40855-0d0c-4322-8722-51115985f573).
+
+- `inpainting-mask.png`, `inpainting-scene.png` and `inpainting-target.png` have been generated as follows:
+    - `inpainting-mask.png`: negated version of a mask computed with [SAM](https://github.com/facebookresearch/segment-anything) automatic mask generation using the `vit_h` checkpoint
+    - `inpainting-scene.png`: cropped-to-square-and-resized version of https://unsplash.com/photos/RCz6eSVPGYU by @jannerboy62
+    - `inpainting-target.png`: computed with `convert <(convert -size 512x512 xc:white png:-) kitchen_dog.png <(convert inpainting-mask.png -negate png:-) -compose Over -composite inpainting-target.png`
+
+## VAE without randomness
+
+```diff
+--- a/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_img2img.py
+++ b/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_img2img.py
+@@ -524,13 +524,8 @@ class StableDiffusionImg2ImgPipeline(DiffusionPipeline):
+                 f" size of {batch_size}. Make sure the batch size matches the length of the generators."
+             )
+
+-        if isinstance(generator, list):
+-            init_latents = [
+-                self.vae.encode(image[i : i + 1]).latent_dist.sample(generator[i]) for i in range(batch_size)
+-            ]
+-            init_latents = torch.cat(init_latents, dim=0)
+-        else:
+-            init_latents = self.vae.encode(image).latent_dist.sample(generator)
+        init_latents = [self.vae.encode(image[i : i + 1]).latent_dist.mean for i in range(batch_size)]
+        init_latents = torch.cat(init_latents, dim=0)
+
+         init_latents = self.vae.config.scaling_factor * init_latents
+```
--- a/tests/e2e/test_diffusion_ref/cutecat_guide_canny.png
+++ b/tests/e2e/test_diffusion_ref/cutecat_guide_canny.png
--- a/tests/e2e/test_diffusion_ref/cutecat_guide_depth.png
+++ b/tests/e2e/test_diffusion_ref/cutecat_guide_depth.png
--- a/tests/e2e/test_diffusion_ref/cutecat_guide_lineart.png
+++ b/tests/e2e/test_diffusion_ref/cutecat_guide_lineart.png
--- a/tests/e2e/test_diffusion_ref/cutecat_guide_normals.png
+++ b/tests/e2e/test_diffusion_ref/cutecat_guide_normals.png
--- a/tests/e2e/test_diffusion_ref/cutecat_guide_sam.png
+++ b/tests/e2e/test_diffusion_ref/cutecat_guide_sam.png
--- a/tests/e2e/test_diffusion_ref/cutecat_init.png
+++ b/tests/e2e/test_diffusion_ref/cutecat_init.png
--- a/tests/e2e/test_diffusion_ref/cyberpunk_guide.png
+++ b/tests/e2e/test_diffusion_ref/cyberpunk_guide.png
--- a/tests/e2e/test_diffusion_ref/expected_controlnet_canny.png
+++ b/tests/e2e/test_diffusion_ref/expected_controlnet_canny.png
--- a/tests/e2e/test_diffusion_ref/expected_controlnet_depth.png
+++ b/tests/e2e/test_diffusion_ref/expected_controlnet_depth.png
--- a/tests/e2e/test_diffusion_ref/expected_controlnet_lineart.png
+++ b/tests/e2e/test_diffusion_ref/expected_controlnet_lineart.png
--- a/tests/e2e/test_diffusion_ref/expected_controlnet_normals.png
+++ b/tests/e2e/test_diffusion_ref/expected_controlnet_normals.png
--- a/tests/e2e/test_diffusion_ref/expected_controlnet_sam.png
+++ b/tests/e2e/test_diffusion_ref/expected_controlnet_sam.png
--- a/tests/e2e/test_diffusion_ref/expected_inpainting_refonly.png
+++ b/tests/e2e/test_diffusion_ref/expected_inpainting_refonly.png
--- a/tests/e2e/test_diffusion_ref/expected_lora_pokemon.png
+++ b/tests/e2e/test_diffusion_ref/expected_lora_pokemon.png
--- a/tests/e2e/test_diffusion_ref/expected_refonly.png
+++ b/tests/e2e/test_diffusion_ref/expected_refonly.png
--- a/tests/e2e/test_diffusion_ref/expected_std_init_image.png
+++ b/tests/e2e/test_diffusion_ref/expected_std_init_image.png
--- a/tests/e2e/test_diffusion_ref/expected_std_inpainting.png
+++ b/tests/e2e/test_diffusion_ref/expected_std_inpainting.png
--- a/tests/e2e/test_diffusion_ref/expected_std_random_init.png
+++ b/tests/e2e/test_diffusion_ref/expected_std_random_init.png
--- a/tests/e2e/test_diffusion_ref/inpainting-mask.png
+++ b/tests/e2e/test_diffusion_ref/inpainting-mask.png
--- a/tests/e2e/test_diffusion_ref/inpainting-scene.png
+++ b/tests/e2e/test_diffusion_ref/inpainting-scene.png
--- a/tests/e2e/test_diffusion_ref/inpainting-target.png
+++ b/tests/e2e/test_diffusion_ref/inpainting-target.png
--- a/tests/e2e/test_diffusion_ref/kitchen_dog.png
+++ b/tests/e2e/test_diffusion_ref/kitchen_dog.png
--- a/tests/e2e/test_diffusion_ref/kitchen_dog_mask.png
+++ b/tests/e2e/test_diffusion_ref/kitchen_dog_mask.png
--- a/Show more
+++ b/Show more