CLIP
CLIPImageEncoder
¶
CLIPImageEncoder(
image_size: int = 224,
embedding_dim: int = 768,
output_dim: int = 512,
patch_size: int = 32,
num_layers: int = 12,
num_attention_heads: int = 12,
feedforward_dim: int = 3072,
layer_norm_eps: float = 1e-05,
device: device | str | None = None,
dtype: dtype | None = None,
)
Bases: Chain
Contrastive Language-Image Pretraining (CLIP) image encoder.
See [arXiv:2103.00020] Learning Transferable Visual Models From Natural Language Supervision for more details.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
image_size
|
int
|
The size of the input image. |
224
|
embedding_dim
|
int
|
The dimension of the embedding. |
768
|
output_dim
|
int
|
The dimension of the output. |
512
|
patch_size
|
int
|
The size of the patches. |
32
|
num_layers
|
int
|
The number of layers. |
12
|
num_attention_heads
|
int
|
The number of attention heads. |
12
|
feedforward_dim
|
int
|
The dimension of the feedforward layer. |
3072
|
layer_norm_eps
|
float
|
The epsilon value for normalization. |
1e-05
|
device
|
device | str | None
|
The PyTorch device to use. |
None
|
dtype
|
dtype | None
|
The PyTorch data type to use. |
None
|
Source code in src/refiners/foundationals/clip/image_encoder.py
CLIPImageEncoderG
¶
Bases: CLIPImageEncoder
CLIP giant image encoder.
See [arXiv:2103.00020] Learning Transferable Visual Models From Natural Language Supervision for more details.
Attributes:
Name | Type | Description |
---|---|---|
embedding_dim |
int
|
1664 |
output_dim |
int
|
1280 |
patch_size |
int
|
14 |
num_layers |
int
|
48 |
num_attention_heads |
int
|
16 |
feedforward_dim |
int
|
8192 |
Parameters:
Name | Type | Description | Default |
---|---|---|---|
device
|
device | str | None
|
The PyTorch device to use. |
None
|
dtype
|
dtype | None
|
The PyTorch data type to use. |
None
|
Source code in src/refiners/foundationals/clip/image_encoder.py
CLIPImageEncoderH
¶
Bases: CLIPImageEncoder
CLIP huge image encoder.
See [arXiv:2103.00020] Learning Transferable Visual Models From Natural Language Supervision for more details.
Attributes:
Name | Type | Description |
---|---|---|
embedding_dim |
int
|
1280 |
output_dim |
int
|
1024 |
patch_size |
int
|
14 |
num_layers |
int
|
32 |
num_attention_heads |
int
|
16 |
feedforward_dim |
int
|
5120 |
Parameters:
Name | Type | Description | Default |
---|---|---|---|
device
|
device | str | None
|
The PyTorch device to use. |
None
|
dtype
|
dtype | None
|
The PyTorch data type to use. |
None
|
Source code in src/refiners/foundationals/clip/image_encoder.py
CLIPTextEncoder
¶
CLIPTextEncoder(
embedding_dim: int = 768,
max_sequence_length: int = 77,
vocabulary_size: int = 49408,
num_layers: int = 12,
num_attention_heads: int = 12,
feedforward_dim: int = 3072,
layer_norm_eps: float = 1e-05,
use_quick_gelu: bool = False,
tokenizer: CLIPTokenizer | None = None,
device: device | str | None = None,
dtype: dtype | None = None,
)
Bases: Chain
Contrastive Language-Image Pretraining (CLIP) text encoder.
See [arXiv:2103.00020] Learning Transferable Visual Models From Natural Language Supervision for more details.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
embedding_dim
|
int
|
The embedding dimension. |
768
|
max_sequence_length
|
int
|
The maximum sequence length. |
77
|
vocabulary_size
|
int
|
The vocabulary size. |
49408
|
num_layers
|
int
|
The number of layers. |
12
|
num_attention_heads
|
int
|
The number of attention heads. |
12
|
feedforward_dim
|
int
|
The feedforward dimension. |
3072
|
layer_norm_eps
|
float
|
The epsilon value for layer normalization. |
1e-05
|
use_quick_gelu
|
bool
|
Whether to use the quick GeLU activation function. |
False
|
tokenizer
|
CLIPTokenizer | None
|
The tokenizer. |
None
|
device
|
device | str | None
|
The PyTorch device to use. |
None
|
dtype
|
dtype | None
|
The PyTorch data type to use. |
None
|
Source code in src/refiners/foundationals/clip/text_encoder.py
CLIPTextEncoderG
¶
Bases: CLIPTextEncoder
CLIP giant text encoder.
See [arXiv:2103.00020] Learning Transferable Visual Models From Natural Language Supervision for more details.
Attributes:
Name | Type | Description |
---|---|---|
embedding_dim |
int
|
1280 |
num_layers |
int
|
32 |
num_attention_heads |
int
|
20 |
feedforward_dim |
int
|
5120 |
tokenizer |
CLIPTokenizer
|
CLIPTokenizer(pad_token_id=0) |
Parameters:
Name | Type | Description | Default |
---|---|---|---|
device
|
device | str | None
|
The PyTorch device to use. |
None
|
dtype
|
dtype | None
|
The PyTorch data type to use. |
None
|
Source code in src/refiners/foundationals/clip/text_encoder.py
CLIPTextEncoderH
¶
Bases: CLIPTextEncoder
CLIP huge text encoder.
See [arXiv:2103.00020] Learning Transferable Visual Models From Natural Language Supervision for more details.
Attributes:
Name | Type | Description |
---|---|---|
embedding_dim |
int
|
1024 |
num_layers |
int
|
23 |
num_attention_heads |
int
|
16 |
feedforward_dim |
int
|
4096 |
Parameters:
Name | Type | Description | Default |
---|---|---|---|
device
|
device | str | None
|
The PyTorch device to use. |
None
|
dtype
|
dtype | None
|
The PyTorch data type to use. |
None
|
Source code in src/refiners/foundationals/clip/text_encoder.py
CLIPTextEncoderL
¶
Bases: CLIPTextEncoder
CLIP large text encoder.
Note
We replace the GeLU activation function with an approximate GeLU to comply with the original CLIP implementation of OpenAI (https://github.com/openai/CLIP/blob/a1d0717/clip/model.py#L166)
See [arXiv:2103.00020] Learning Transferable Visual Models From Natural Language Supervision for more details.
Attributes:
Name | Type | Description |
---|---|---|
embedding_dim |
int
|
768 |
num_layers |
int
|
12 |
num_attention_heads |
int
|
12 |
feedforward_dim |
int
|
3072 |
use_quick_gelu |
bool
|
True |
Parameters:
Name | Type | Description | Default |
---|---|---|---|
device
|
device | str | None
|
The PyTorch device to use. |
None
|
dtype
|
dtype | None
|
The PyTorch data type to use. |
None
|