Skip to content

Commit

Permalink
added tokenizer content to scope
Browse files Browse the repository at this point in the history
  • Loading branch information
sariola committed Aug 22, 2024
1 parent 41425fe commit 2690243
Show file tree
Hide file tree
Showing 3 changed files with 203 additions and 18 deletions.
193 changes: 190 additions & 3 deletions docs/Writerside/topics/Engineering-Scope-Problems-Solved.topic
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@
<chapter title="Specs" id="0chapter0">

<chapter title="Loading and validating the configuration" id="loading-and-validating-the-configuration">
Brief overview here..


<procedure title="Validating yaml fields"
Expand Down Expand Up @@ -58,8 +59,9 @@
</procedure>

</chapter>

<br/>
<chapter title="Loading models" id="loading-models">
Brief overview here..

<procedure title="Loading models from local" id="loading_models_from_local" type="choices" collapsible="true" default-state="collapsed">
<step>
Expand Down Expand Up @@ -99,10 +101,195 @@

</chapter>

<br/>
<chapter title="Tokenizer" id="tokenizer">
Brief overview here..

<procedure title="Finding differences in the tokenizers of the ingredient models"
id="finding-differences-in-tokenizers"
type="choices"
collapsible="true"
default-state="collapsed">
<step>
<p><b>Problem</b></p>
When we merge multiple models, they might have differing tokenizers and vocabulary.
In order to create a good final model, we want to preserve the encoding of the tokens
that the models were trained for as well as possible.
For this reason we produce a union of the tokens to be used.
</step>
<step>
<p><b>Constraints & caveats</b></p>
What are the constraints, explain here
</step>
<step>
<p><b>Functions used</b></p>
<p><u>Step 1</u><br/> Loading all of the tokenizers with <a href="https://github.com/flowritecom/flow-merge/blob/41425fecbb93aba9faaa703d726bf819913ee076/flow_merge/lib/tokenizer.py#L22"><b>load_all_tokenizers</b></a> method inside TokenizerLoader class</p>
<code-block lang="python">
class TokenizerLoader:
@staticmethod
def load_all_tokenizers(models_ids: List[str], config: ApplicationConfig) ->
Dict[str, PreTrainedTokenizerBase]:
all_tokenizers = {}
for model_id in models_ids:
try:
tokenizer = AutoTokenizer.from_pretrained(
model_id,
trust_remote_code=config.trust_remote_code,
)
except Exception as e:
error_message = f"Error loading tokenizer for {model_id}: {e}"
logging.error(error_message)
raise RuntimeError(error_message)
all_tokenizers[model_id] = tokenizer
return all_tokenizers
</code-block>

<br/><p><u>Step 2</u><br/>Checking for differences between the tokenizers of the models to be merged.<br/>
<br/>Checking three different things:<br/>
1. vocabularies<br/>
2. special tokens<br/>
3. added tokens encoders<br/><br/>

Firstly, we attempt to find <b>difference in vocabularies</b>. This means the __ that the model __. </p>
<code-block lang="python">
@staticmethod
def _compare_tokenizer_vocabs(
model_a: str,
tokenizer_a: PreTrainedTokenizerBase,
model_b: str,
tokenizer_b: PreTrainedTokenizerBase,
) -> bool:
vocab_a = tokenizer_a.get_vocab()
vocab_b = tokenizer_b.get_vocab()

if vocab_a != vocab_b:
logging.info(f"Tokenizer for model {model_a} has different vocab compared to model {model_b}.")
return True
return False
</code-block>
<br/><p>
Secondly, we <b>compare the special tokens</b>, which ___. </p>
<code-block lang="python">
@staticmethod
def _compare_special_tokens(
model_a: str,
tokenizer_a: PreTrainedTokenizerBase,
model_b: str,
tokenizer_b: PreTrainedTokenizerBase,
) -> bool:
special_tokens_a = tokenizer_a.special_tokens_map
special_tokens_b = tokenizer_b.special_tokens_map

if special_tokens_a != special_tokens_b:
logging.info(f"Tokenizer for model {model_a} has different special tokens compared to model {model_b}.")
return True
return False
</code-block>

<br/><p>Thirdly, we <b>compare added tokens encoders</b>, which ___.</p>
<code-block lang="python">
@staticmethod
def _compare_added_tokens_encoders(
model_a: str,
tokenizer_a: PreTrainedTokenizerBase,
model_b: str,
tokenizer_b: PreTrainedTokenizerBase,
) -> bool:
added_tokens_encoder_a = tokenizer_a.added_tokens_encoder
added_tokens_encoder_b = tokenizer_b.added_tokens_encoder

if added_tokens_encoder_a != added_tokens_encoder_b:
logging.info(
f"Tokenizer for model {model_a} has different added tokens encoder compared to model {model_b}.")
return True
return False
</code-block>
<br/><p>And finally, we put everything together and form a logical <b>union set</b> from the three checking methods.
If even one of them fails, we record that there are differences as the output of this function,
and in the next steps we proceed to merge a common tokenizer to eliminate that difference.</p>
<code-block lang="python">
@staticmethod
def check_tokenizers_for_differences(tokenizers: Dict[str, PreTrainedTokenizerBase]) -> bool:
differences_found = False

for (model_a, tokenizer_a), (model_b, tokenizer_b) in combinations(tokenizers.items(), 2):
differences_found |= TokenizerValidator._compare_tokenizer_vocabs(model_a, tokenizer_a, model_b,
tokenizer_b)
differences_found |= TokenizerValidator._compare_special_tokens(model_a, tokenizer_a, model_b, tokenizer_b)
differences_found |= TokenizerValidator._compare_added_tokens_encoders(model_a, tokenizer_a, model_b,
tokenizer_b)
return differences_found
</code-block>
</step>
</procedure>
<procedure title="Creating a common tokenizer for the output model"
id="creating-common-tokenizer"
type="choices"
collapsible="true"
default-state="collapsed">
<step>
<p><b>Problem</b></p>
Lorem Ipsum
</step>
<step>
<p><b>Constraints & caveats</b></p>
What are the constraints, explain here
</step>
<step>
<p><b>Functions used</b></p>
<p><u>Step 1</u><br/> Loading all of the tokenizers with <a href="https://github.com/flowritecom/flow-merge/blob/41425fecbb93aba9faaa703d726bf819913ee076/flow_merge/lib/tokenizer.py#L22"><b>load_all_tokenizers</b></a> method inside TokenizerLoader class</p>
<code-block lang="python">
class TokenizerLoader:
@staticmethod
def load_all_tokenizers(models_ids: List[str], config: ApplicationConfig) ->
Dict[str, PreTrainedTokenizerBase]:
all_tokenizers = {}
for model_id in models_ids:
try:
tokenizer = AutoTokenizer.from_pretrained(
model_id,
trust_remote_code=config.trust_remote_code,
)
except Exception as e:
error_message = f"Error loading tokenizer for {model_id}: {e}"
logging.error(error_message)
raise RuntimeError(error_message)
all_tokenizers[model_id] = tokenizer
return all_tokenizers
</code-block>

<br/><p><u>Step 2</u><br/>Checking for differences between the tokenizers of the models to be merged.<br/>
<br/>Checking three different things:<br/>
1. vocabularies<br/>
2. special tokens<br/>
3. added tokens encoders<br/><br/>

<chapter title="Adapters" id="adapters">
Firstly, we attempt to find <b>difference in vocabularies</b>. This means the __ that the model __. </p>
<code-block lang="python">
# first
</code-block>
<br/><p>
Secondly, we ___ </p>
<code-block lang="python">
# second
</code-block>

<br/><p>Thirdly, we __, which ___.</p>
<code-block lang="python">
# third
</code-block>
<br/><p>And finally, ____</p>
<code-block lang="python">
# fourth
</code-block>
</step>
</procedure>
</chapter>
<br/>
<chapter title="Adapters" id="adapters">

<procedure title="Loading adapter from Hugging Face" id="loading_adapter_from_hugging_face" type="choices" collapsible="true" default-state="collapsed">
<procedure title="Loading adapter from Hugging Face" id="loading_adapter_from_hugging_face"
type="choices" collapsible="true" default-state="collapsed">
<step>
<p>Problem</p>
</step>
Expand Down
2 changes: 0 additions & 2 deletions docs/Writerside/topics/Load.topic
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,9 @@
<topic xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:noNamespaceSchemaLocation="https://resources.jetbrains.com/writerside/1.0/topic.v2.xsd"
title="Loading a merge configuration" id="Load">

<p>
This section explains how we load merging tasks and their related settings.
</p>

<chapter title="Initialize a merge plan from configuration" id="read-merge-configuration">
<tabs>
<tab title="Read from yaml file">
Expand Down
26 changes: 13 additions & 13 deletions flake.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

0 comments on commit 2690243

Please sign in to comment.