Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add composer model class for running with precomputed CLIP and T5 text latents #171

Open
wants to merge 24 commits into
base: main
Choose a base branch
from

Conversation

coryMosaicML
Copy link
Collaborator

This PR adds a model class for running with precomputed text latents. It is largely similar to the SDXL model, with the exception of concatenating latents from the different text encoders along a sequence dimension rather than a feature dimension after projecting to a common embedding dimension.

This builds off of 152 which can now be closed.

Copy link
Contributor

@jazcollins jazcollins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left some comments!

@@ -140,6 +143,15 @@ def __getitem__(self, index):
if 'CLIP_LATENTS' in latent_key:
clip_pooled = np.frombuffer(sample[f'{caption_key}_CLIP_POOLED_TEXT'], dtype=np.float32).copy()
out['CLIP_POOLED'] = torch.from_numpy(clip_pooled).to(self.latent_dtype).reshape(latent_shape[1])
if self.drop_nans:
if out['CLIP_POOLED'].isnan().any():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will work for us, but this section will fail if the text latent keys are named differently - probably safer to loop thru self.text_latent_keys and self.attention_mask_keys, e.g. something like:

if self.drop_nans:
    for latent_key, attn_key in zip(self.text_latent_keys, self.attention_mask_keys):
        if out[latent_key].isnan().any():
            out[latent_key] = torch.zeros_like(out[latent_key])
            out[attn_key] = torch.zeros_like(out[attn_key])
        if 'CLIP_LATENTS' in latent_key:
            out['CLIP_POOLED'] = torch.zeros_like(out['CLIP_POOLED'])

@@ -121,8 +121,9 @@ def __init__(self,
clip_attention_mask = clip_attention_mask.cpu().to(torch.long)

latent_batch['T5_LATENTS'] = t5_latents
latent_batch['T5_ATTENTION_MASK'] = t5_attention_mask
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ideally these key names wouldn't be hardcoded and would be grabbed from the dataset class but i think this OK for now

cache_dir=cache_dir,
local_files_only=False)
t5_encoder = AutoModel.from_pretrained('google/t5-v1_1-xxl',
torch_dtype=torch.float16,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm should be bfloat16? or read the dtype from an argument?

local_files_only=False).encoder.eval()
clip_encoder = CLIPTextModel.from_pretrained('stabilityai/stable-diffusion-xl-base-1.0',
subfolder='text_encoder',
torch_dtype=torch.float16,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment about bfloat

latents = self.encode_images(inputs)

# Text embeddings are shape (B, seq_len, emb_dim), optionally truncate to a max length
t5_embed = batch['T5_LATENTS']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just to write comments consistent with what i said for the dataloader, probably safest to grab these key names from an argument but this is totally fine for our use cases

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants