It is implicitly defined by the underlying cross-attention layer. This also makes it consistent with SDXL.