Skip to content

Commit

Permalink
feat: pretrained tokenizers
Browse files Browse the repository at this point in the history
  • Loading branch information
Hk669 committed Jun 4, 2024
1 parent 4f713b6 commit 878f9ea
Show file tree
Hide file tree
Showing 2 changed files with 4 additions and 1 deletion.
5 changes: 4 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,14 +22,17 @@ Every LLM(LLama, Gemini, Mistral..) use their own Tokenizers trained on their ow
- Compatible with Python 3.9 and above


#### This repository has 2 different Tokenizers:
#### This repository has 3 different Tokenizers:
- `BPETokenizer`
- `Tokenizer`
- `PreTrained`

1. [Tokenizer](bpetokenizer/base.py): This class contains `train`, `encode`, `decode` and functionalities to `save` and `load`. Also contains few helper functions `get_stats`, `merge`, `replace_control_characters`.. to perform the BPE algorithm for the tokenizer.

2. [BPETokenizer](bpetokenizer/tokenizer.py): This class emphasizes the real power of the tokenizer(used in gpt4 tokenizer..[tiktoken](https://github.com/openai/tiktoken)), uses the `GPT4_SPLIT_PATTERN` to split the text as mentioned in the gpt4 tokenizer. also handles the `special_tokens` (refer [sample_bpetokenizer](sample/bpetokenizer/sample_bpetokenizer.py)). which inherits the `save` and `load` functionlities to save and load the tokenizer respectively.

3. [PreTrained Tokenizer](pretrained/wi17k_base.json): PreTrained Tokenizer wi17k_base, has a 17316 vocabulary. trained with the wikitext dataset (len: 1000000). with 6 special_tokens.


### Usage

Expand Down
File renamed without changes.

0 comments on commit 878f9ea

Please sign in to comment.