added tokenizer content to scope

flowritecom · Aug 22, 2024 · 2690243 · 2690243
1 parent 41425fe
commit 2690243
Show file tree

Hide file tree

Showing 3 changed files with 203 additions and 18 deletions.
diff --git a/docs/Writerside/topics/Engineering-Scope-Problems-Solved.topic b/docs/Writerside/topics/Engineering-Scope-Problems-Solved.topic
@@ -15,6 +15,7 @@
         <chapter title="Specs" id="0chapter0">
 
             <chapter title="Loading and validating the configuration" id="loading-and-validating-the-configuration">
+                Brief overview here..
 
 
         <procedure title="Validating yaml fields"
@@ -58,8 +59,9 @@
         </procedure>
 
             </chapter>
-
+        <br/>
         <chapter title="Loading models" id="loading-models">
+            Brief overview here..
 
             <procedure title="Loading models from local" id="loading_models_from_local" type="choices" collapsible="true" default-state="collapsed">
                 <step>
@@ -99,10 +101,195 @@
 
         </chapter>
 
+            <br/>
+            <chapter title="Tokenizer" id="tokenizer">
+                Brief overview here..
+
+                <procedure title="Finding differences in the tokenizers of the ingredient models"
+                           id="finding-differences-in-tokenizers"
+                           type="choices"
+                           collapsible="true"
+                           default-state="collapsed">
+                    <step>
+                        <p><b>Problem</b></p>
+                        When we merge multiple models, they might have differing tokenizers and vocabulary.
+                        In order to create a good final model, we want to preserve the encoding of the tokens
+                        that the models were trained for as well as possible.
+                        For this reason we produce a union of the tokens to be used.
+                    </step>
+                    <step>
+                        <p><b>Constraints & caveats</b></p>
+                        What are the constraints, explain here
+                    </step>
+                    <step>
+                        <p><b>Functions used</b></p>
+                        <p><u>Step 1</u><br/> Loading all of the tokenizers with <a href="https://github.com/flowritecom/flow-merge/blob/41425fecbb93aba9faaa703d726bf819913ee076/flow_merge/lib/tokenizer.py#L22"><b>load_all_tokenizers</b></a> method inside TokenizerLoader class</p>
+                        <code-block lang="python">
+                            class TokenizerLoader:
+                                @staticmethod
+                                def load_all_tokenizers(models_ids: List[str], config: ApplicationConfig) ->
+                                    Dict[str, PreTrainedTokenizerBase]:
+                                    all_tokenizers = {}
+                                    for model_id in models_ids:
+                                        try:
+                                            tokenizer = AutoTokenizer.from_pretrained(
+                                                model_id,
+                                                trust_remote_code=config.trust_remote_code,
+                                            )
+                                        except Exception as e:
+                                            error_message = f"Error loading tokenizer for {model_id}: {e}"
+                                            logging.error(error_message)
+                                            raise RuntimeError(error_message)
+                                        all_tokenizers[model_id] = tokenizer
+                                    return all_tokenizers
+                        </code-block>
+
+                        <br/><p><u>Step 2</u><br/>Checking for differences between the tokenizers of the models to be merged.<br/>
+                        <br/>Checking three different things:<br/>
+                        1. vocabularies<br/>
+                        2. special tokens<br/>
+                        3. added tokens encoders<br/><br/>
+
+                        Firstly, we attempt to find <b>difference in vocabularies</b>. This means the __ that the model __. </p>
+                        <code-block lang="python">
+                            @staticmethod
+                            def _compare_tokenizer_vocabs(
+                                    model_a: str,
+                                    tokenizer_a: PreTrainedTokenizerBase,
+                                    model_b: str,
+                                    tokenizer_b: PreTrainedTokenizerBase,
+                            ) -> bool:
+                                vocab_a = tokenizer_a.get_vocab()
+                                vocab_b = tokenizer_b.get_vocab()
+
+                                if vocab_a != vocab_b:
+                                    logging.info(f"Tokenizer for model {model_a} has different vocab compared to model {model_b}.")
+                                    return True
+                                return False
+                        </code-block>
+                        <br/><p>
+                        Secondly, we <b>compare the special tokens</b>, which ___. </p>
+                        <code-block lang="python">
+                            @staticmethod
+                            def _compare_special_tokens(
+                                    model_a: str,
+                                    tokenizer_a: PreTrainedTokenizerBase,
+                                    model_b: str,
+                                    tokenizer_b: PreTrainedTokenizerBase,
+                            ) -> bool:
+                                special_tokens_a = tokenizer_a.special_tokens_map
+                                special_tokens_b = tokenizer_b.special_tokens_map
+
+                                if special_tokens_a != special_tokens_b:
+                                    logging.info(f"Tokenizer for model {model_a} has different special tokens compared to model {model_b}.")
+                                    return True
+                                return False
+                        </code-block>
+
+                        <br/><p>Thirdly, we <b>compare added tokens encoders</b>, which ___.</p>
+                        <code-block lang="python">
+                            @staticmethod
+                            def _compare_added_tokens_encoders(
+                                    model_a: str,
+                                    tokenizer_a: PreTrainedTokenizerBase,
+                                    model_b: str,
+                                    tokenizer_b: PreTrainedTokenizerBase,
+                            ) -> bool:
+                                added_tokens_encoder_a = tokenizer_a.added_tokens_encoder
+                                added_tokens_encoder_b = tokenizer_b.added_tokens_encoder
+
+                                if added_tokens_encoder_a != added_tokens_encoder_b:
+                                    logging.info(
+                                        f"Tokenizer for model {model_a} has different added tokens encoder compared to model {model_b}.")
+                                    return True
+                                return False
+                        </code-block>
+                        <br/><p>And finally, we put everything together and form a logical <b>union set</b> from the three checking methods.
+                        If even one of them fails, we record that there are differences as the output of this function,
+                        and in the next steps we proceed to merge a common tokenizer to eliminate that difference.</p>
+                        <code-block lang="python">
+                            @staticmethod
+                            def check_tokenizers_for_differences(tokenizers: Dict[str, PreTrainedTokenizerBase]) -> bool:
+                                differences_found = False
+
+                                for (model_a, tokenizer_a), (model_b, tokenizer_b) in combinations(tokenizers.items(), 2):
+                                    differences_found |= TokenizerValidator._compare_tokenizer_vocabs(model_a, tokenizer_a, model_b,
+                                                                                                      tokenizer_b)
+                                    differences_found |= TokenizerValidator._compare_special_tokens(model_a, tokenizer_a, model_b, tokenizer_b)
+                                    differences_found |= TokenizerValidator._compare_added_tokens_encoders(model_a, tokenizer_a, model_b,
+                                                                                                           tokenizer_b)
+                                return differences_found
+                        </code-block>
+                    </step>
+                </procedure>
+                <procedure title="Creating a common tokenizer for the output model"
+                       id="creating-common-tokenizer"
+                       type="choices"
+                       collapsible="true"
+                       default-state="collapsed">
+                    <step>
+                    <p><b>Problem</b></p>
+                        Lorem Ipsum
+                    </step>
+                    <step>
+                        <p><b>Constraints & caveats</b></p>
+                        What are the constraints, explain here
+                    </step>
+                <step>
+                    <p><b>Functions used</b></p>
+                    <p><u>Step 1</u><br/> Loading all of the tokenizers with <a href="https://github.com/flowritecom/flow-merge/blob/41425fecbb93aba9faaa703d726bf819913ee076/flow_merge/lib/tokenizer.py#L22"><b>load_all_tokenizers</b></a> method inside TokenizerLoader class</p>
+                    <code-block lang="python">
+                        class TokenizerLoader:
+                            @staticmethod
+                            def load_all_tokenizers(models_ids: List[str], config: ApplicationConfig) ->
+                                Dict[str, PreTrainedTokenizerBase]:
+                                all_tokenizers = {}
+                                for model_id in models_ids:
+                                    try:
+                                        tokenizer = AutoTokenizer.from_pretrained(
+                                            model_id,
+                                            trust_remote_code=config.trust_remote_code,
+                                        )
+                                    except Exception as e:
+                                        error_message = f"Error loading tokenizer for {model_id}: {e}"
+                                        logging.error(error_message)
+                                        raise RuntimeError(error_message)
+                                    all_tokenizers[model_id] = tokenizer
+                                return all_tokenizers
+                    </code-block>
+
+                    <br/><p><u>Step 2</u><br/>Checking for differences between the tokenizers of the models to be merged.<br/>
+                    <br/>Checking three different things:<br/>
+                    1. vocabularies<br/>
+                    2. special tokens<br/>
+                    3. added tokens encoders<br/><br/>
 
-                <chapter title="Adapters" id="adapters">
+                    Firstly, we attempt to find <b>difference in vocabularies</b>. This means the __ that the model __. </p>
+                    <code-block lang="python">
+                        # first
+                    </code-block>
+                    <br/><p>
+                    Secondly, we ___ </p>
+                    <code-block lang="python">
+                        # second
+                    </code-block>
+
+                    <br/><p>Thirdly, we __, which ___.</p>
+                    <code-block lang="python">
+                        # third
+                    </code-block>
+                    <br/><p>And finally, ____</p>
+                    <code-block lang="python">
+                        # fourth
+                    </code-block>
+                </step>
+            </procedure>
+            </chapter>
+            <br/>
+            <chapter title="Adapters" id="adapters">
 
-                    <procedure title="Loading adapter from Hugging Face" id="loading_adapter_from_hugging_face" type="choices" collapsible="true" default-state="collapsed">
+                    <procedure title="Loading adapter from Hugging Face" id="loading_adapter_from_hugging_face"
+                               type="choices" collapsible="true" default-state="collapsed">
                         <step>
                             <p>Problem</p>
                         </step>

diff --git a/docs/Writerside/topics/Load.topic b/docs/Writerside/topics/Load.topic
@@ -4,11 +4,9 @@
 <topic xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
        xsi:noNamespaceSchemaLocation="https://resources.jetbrains.com/writerside/1.0/topic.v2.xsd"
        title="Loading a merge configuration" id="Load">
-
     <p>
         This section explains how we load merging tasks and their related settings.
     </p>
-
     <chapter title="Initialize a merge plan from configuration" id="read-merge-configuration">
         <tabs>
             <tab title="Read from yaml file">

diff --git a/flake.lock b/flake.lock