Skip to content

Commit

Permalink
update 1.0.3
Browse files Browse the repository at this point in the history
  • Loading branch information
Hk669 committed May 28, 2024
1 parent b491ec5 commit 781c43d
Show file tree
Hide file tree
Showing 2 changed files with 34 additions and 3 deletions.
35 changes: 33 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# bpetokenizer

A Byte Pair Encoding (BPE) tokenizer, which algorithmically follows along the GPT tokenizer. The tokenizer is capable of handling special tokens and uses a customizable regex pattern for tokenization(includes the gpt4 regex pattern).
A Byte Pair Encoding (BPE) tokenizer, which algorithmically follows along the GPT tokenizer. The tokenizer is capable of handling special tokens and uses a customizable regex pattern for tokenization(includes the gpt4 regex pattern). supports `save` and `load` tokenizers in the `json` and `file` format.


### Overview
Expand Down Expand Up @@ -76,12 +76,43 @@ print(ids)
decode_text = tokenizer.decode(ids)
print(decode_text)

tokenizer.save("sample_bpetokenizer")
tokenizer.save("sample_bpetokenizer", mode="json") # mode: default is file
```

refer [sample_bpetokenizer](sample/bpetokenizer) to have an understanding of the `vocab` and the `model` file of the tokenizer trained on the above texts.


#### To Load the Tokenizer

```py
from bpetokenizer import BPETokenizer

tokenizer = BPETokenizer()

tokenizer.load("sample_bpetokenizer.json", mode="json")

encode_text = """
<|startoftext|>Hello, World! This is a sample text with the special tokens [SPECIAL1] and [SPECIAL2] to test the tokenizer.
Hello, Universe! Another example sentence containing [SPECIAL1] and [SPECIAL2], used to ensure tokenizer's robustness.
Greetings, Earth! Here we have [SPECIAL1] appearing once again, followed by [SPECIAL2] in the same sentence.<|endoftext|>"""

print("vocab: ", tokenizer.vocab)
print('---')
print("merges: ", tokenizer.merges)
print('---')
print("special tokens: ", tokenizer.special_tokens)

ids = tokenizer.encode(encode_text, special_tokens="all")
print('---')
print(ids)

decode_text = tokenizer.decode(ids)
print('---')
print(decode_text)

```
refer to the [load_json_vocab](sample/load_json_vocab/) and run the `bpetokenizer_json` to get an overview of `vocab`, `merges`, `special_tokens`.

### Run Tests

the tests folder `tests/` include the tests of the tokenizer, uses pytest.
Expand Down
2 changes: 1 addition & 1 deletion bpetokenizer/version.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
__version__ = "1.0.2"
__version__ = "1.0.3"

0 comments on commit 781c43d

Please sign in to comment.