Add composer model class for running with precomputed CLIP and T5 text latents #171

coryMosaicML · 2024-09-11T21:51:32Z

This PR adds a model class for running with precomputed text latents. It is largely similar to the SDXL model, with the exception of concatenating latents from the different text encoders along a sequence dimension rather than a feature dimension after projecting to a common embedding dimension.

This builds off of 152 which can now be closed.

jazcollins

left some comments!

jazcollins · 2024-09-11T23:25:13Z

diffusion/datasets/image_caption_latents.py

@@ -140,6 +143,15 @@ def __getitem__(self, index):
                if 'CLIP_LATENTS' in latent_key:
                    clip_pooled = np.frombuffer(sample[f'{caption_key}_CLIP_POOLED_TEXT'], dtype=np.float32).copy()
                    out['CLIP_POOLED'] = torch.from_numpy(clip_pooled).to(self.latent_dtype).reshape(latent_shape[1])
+        if self.drop_nans:
+            if out['CLIP_POOLED'].isnan().any():


this will work for us, but this section will fail if the text latent keys are named differently - probably safer to loop thru self.text_latent_keys and self.attention_mask_keys, e.g. something like:

if self.drop_nans: for latent_key, attn_key in zip(self.text_latent_keys, self.attention_mask_keys): if out[latent_key].isnan().any(): out[latent_key] = torch.zeros_like(out[latent_key]) out[attn_key] = torch.zeros_like(out[attn_key]) if 'CLIP_LATENTS' in latent_key: out['CLIP_POOLED'] = torch.zeros_like(out['CLIP_POOLED'])

jazcollins · 2024-09-11T23:28:34Z

diffusion/callbacks/log_diffusion_images.py

@@ -121,8 +121,9 @@ def __init__(self,
                clip_attention_mask = clip_attention_mask.cpu().to(torch.long)

                latent_batch['T5_LATENTS'] = t5_latents
+                latent_batch['T5_ATTENTION_MASK'] = t5_attention_mask


ideally these key names wouldn't be hardcoded and would be grabbed from the dataset class but i think this OK for now

jazcollins · 2024-09-11T23:31:41Z

diffusion/models/models.py

+                                                       cache_dir=cache_dir,
+                                                       local_files_only=False)
+        t5_encoder = AutoModel.from_pretrained('google/t5-v1_1-xxl',
+                                               torch_dtype=torch.float16,


hmm should be bfloat16? or read the dtype from an argument?

jazcollins · 2024-09-11T23:31:50Z

diffusion/models/models.py

+                                               local_files_only=False).encoder.eval()
+        clip_encoder = CLIPTextModel.from_pretrained('stabilityai/stable-diffusion-xl-base-1.0',
+                                                     subfolder='text_encoder',
+                                                     torch_dtype=torch.float16,


same comment about bfloat

jazcollins · 2024-09-11T23:52:08Z

diffusion/models/precomputed_text_latent_diffusion.py

+        latents = self.encode_images(inputs)
+
+        # Text embeddings are shape (B, seq_len, emb_dim), optionally truncate to a max length
+        t5_embed = batch['T5_LATENTS']


just to write comments consistent with what i said for the dataloader, probably safest to grab these key names from an argument but this is totally fine for our use cases

corystephenson-db added 21 commits August 29, 2024 00:09

Initial model class

3459613

Support truncating embeddings

4b5b227

Truncate before embedding, and add position embeds

39ed5f7

Don't need an arg for the loss

8ef76e6

Prep for inference

402b1a5

Prep for string inputs

19cc8fb

Don't need to check for negative prompt existing

101c353

Timesteps shall be ints

3402223

Fix off by one

72476c2

Add layernorms before sequence concat

363edd1

Changes for running with bf16

260eb6c

Update docstrings and fix types

e30fa2b

Drop nans

ca94d5b

Clean up names

3d7b65e

Fix depreciation

7bc1796

More name changes

4c36a06

Fixes for running inference

040afae

Update docstrings

9abc15b

Configurable schedulers

ebacb59

Add schedule shifting

5ceee3f

Add option for LoRA

d950a50

jazcollins reviewed Sep 11, 2024

View reviewed changes

corystephenson-db added 3 commits September 14, 2024 05:21

Proper tensor timestep

40ecb59

Add option for pre-bucketed aspect ratio buckets

479fe54

Fix some missing keys

e7fcb59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add composer model class for running with precomputed CLIP and T5 text latents #171

Add composer model class for running with precomputed CLIP and T5 text latents #171

coryMosaicML commented Sep 11, 2024

jazcollins left a comment

jazcollins Sep 11, 2024

jazcollins Sep 11, 2024

jazcollins Sep 11, 2024

jazcollins Sep 11, 2024

jazcollins Sep 11, 2024

Add composer model class for running with precomputed CLIP and T5 text latents #171

Are you sure you want to change the base?

Add composer model class for running with precomputed CLIP and T5 text latents #171

Conversation

coryMosaicML commented Sep 11, 2024

jazcollins left a comment

Choose a reason for hiding this comment

jazcollins Sep 11, 2024

Choose a reason for hiding this comment

jazcollins Sep 11, 2024

Choose a reason for hiding this comment

jazcollins Sep 11, 2024

Choose a reason for hiding this comment

jazcollins Sep 11, 2024

Choose a reason for hiding this comment

jazcollins Sep 11, 2024

Choose a reason for hiding this comment