-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add composer model class for running with precomputed CLIP and T5 text latents #171
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left some comments!
@@ -140,6 +143,15 @@ def __getitem__(self, index): | |||
if 'CLIP_LATENTS' in latent_key: | |||
clip_pooled = np.frombuffer(sample[f'{caption_key}_CLIP_POOLED_TEXT'], dtype=np.float32).copy() | |||
out['CLIP_POOLED'] = torch.from_numpy(clip_pooled).to(self.latent_dtype).reshape(latent_shape[1]) | |||
if self.drop_nans: | |||
if out['CLIP_POOLED'].isnan().any(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will work for us, but this section will fail if the text latent keys are named differently - probably safer to loop thru self.text_latent_keys
and self.attention_mask_keys
, e.g. something like:
if self.drop_nans:
for latent_key, attn_key in zip(self.text_latent_keys, self.attention_mask_keys):
if out[latent_key].isnan().any():
out[latent_key] = torch.zeros_like(out[latent_key])
out[attn_key] = torch.zeros_like(out[attn_key])
if 'CLIP_LATENTS' in latent_key:
out['CLIP_POOLED'] = torch.zeros_like(out['CLIP_POOLED'])
@@ -121,8 +121,9 @@ def __init__(self, | |||
clip_attention_mask = clip_attention_mask.cpu().to(torch.long) | |||
|
|||
latent_batch['T5_LATENTS'] = t5_latents | |||
latent_batch['T5_ATTENTION_MASK'] = t5_attention_mask |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ideally these key names wouldn't be hardcoded and would be grabbed from the dataset class but i think this OK for now
cache_dir=cache_dir, | ||
local_files_only=False) | ||
t5_encoder = AutoModel.from_pretrained('google/t5-v1_1-xxl', | ||
torch_dtype=torch.float16, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm should be bfloat16? or read the dtype from an argument?
local_files_only=False).encoder.eval() | ||
clip_encoder = CLIPTextModel.from_pretrained('stabilityai/stable-diffusion-xl-base-1.0', | ||
subfolder='text_encoder', | ||
torch_dtype=torch.float16, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment about bfloat
latents = self.encode_images(inputs) | ||
|
||
# Text embeddings are shape (B, seq_len, emb_dim), optionally truncate to a max length | ||
t5_embed = batch['T5_LATENTS'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just to write comments consistent with what i said for the dataloader, probably safest to grab these key names from an argument but this is totally fine for our use cases
This PR adds a model class for running with precomputed text latents. It is largely similar to the SDXL model, with the exception of concatenating latents from the different text encoders along a sequence dimension rather than a feature dimension after projecting to a common embedding dimension.
This builds off of 152 which can now be closed.