Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] free speed/mem optimizations with ahash, dary_heap, and compact_str #1618

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

mjbommar
Copy link

Summary

Given that this library is largely an interface to hash maps of strings in rust, we can get "free" 5-25% free speedups by using stable, well-tested drop-ins like ahash::HashMap, dary_heap::NHeap, and CompactString.

The improvements span both training and subsequent encode/decode.

Notes

  • We tested a few other options, like directly hashing the strings, using smol or a custom Huffman encoding for shorter lengths, using a BiHashMap like from bimap, etc. This was the best performing.
  • We have already tested the improvements with the rust library directly on very large training corpora and have seen no issues on linux/x86 and apple silicon.
  • Microbenchmarks from benches look good (shown below).
  • No regressions after testing with valgrind (although there was already ~420K leaked with cargo bench on HEAD).

Issue

Because of the way that the interface is organized across the core rust library and py/node bindings, there isn't an easy way to merge this with support for encode/decode.

For example, because Model is defined on the rust side and Vocab traits are used differently between different models, we'd have to use pyo3 within the rust library for PyFromObject.

In theory, we could implement these changes only within the trainer, but the real user-facing/environmental impact would be to implement into the encode/decode bindings where most usage probably occurs.

Choices

Assuming you want to merge something like this, I think we have a few choices:

  1. Refactor the way that Vocab is managed within the Rust library and where the traits are implemented.
  2. Only implement improvements within Trainers.

Example Benchmark (i7-12700K)

NB: We replaced data/big.txt with a much larger text corpus (271M vs 6.2M) but results were comparable for original data/big.txt.

$ git checkout HEAD~1
$ cargo build && /usr/bin/time -v cargo bench --bench bpe_benchmark
$ git checkout consolidated-optimization-ahash-dary-compact-str
$ cargo build && /usr/bin/time -v cargo bench --bench bpe_benchmark

Results:

Before

BPE GPT2 encode         time:   [10.512 µs 10.527 µs 10.544 µs]
                        change: [-0.3616% +0.8589% +1.7420%] (p = 0.13 > 0.05)
                        No change in performance detected.
Found 1 outliers among 20 measurements (5.00%)
  1 (5.00%) high mild

BPE GPT2 encode batch   time:   [4.6695 ms 4.6921 ms 4.7171 ms]
                        change: [+11.476% +12.090% +12.746%] (p = 0.00 < 0.05)
                        Performance has regressed.

BPE GPT2 encode, no cache
                        time:   [18.042 µs 18.094 µs 18.163 µs]
                        change: [+9.9213% +11.676% +13.421%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 2 outliers among 20 measurements (10.00%)
  2 (10.00%) high mild

BPE GPT2 encode batch, no cache
                        time:   [5.0045 ms 5.0233 ms 5.0421 ms]
                        change: [+9.4710% +10.196% +10.992%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 2 outliers among 20 measurements (10.00%)
  1 (5.00%) low mild
  1 (5.00%) high mild

BPE Train vocabulary (small)
                        time:   [15.129 ms 15.234 ms 15.415 ms]
                        change: [+4.1576% +5.2339% +6.3317%] (p = 0.00 < 0.05)
                        Performance has regressed.

Benchmarking BPE Train vocabulary (big): Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 22.4s.
BPE Train vocabulary (big)
                        time:   [2.2645 s 2.2734 s 2.2828 s]
                        change: [+6.4027% +6.9885% +7.6193%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild

After

BPE GPT2 encode         time:   [10.600 µs 10.634 µs 10.673 µs]
                        change: [+0.2136% +1.1903% +2.4177%] (p = 0.03 < 0.05)
                        Change within noise threshold.
Found 3 outliers among 20 measurements (15.00%)
  1 (5.00%) high mild
  2 (10.00%) high severe

BPE GPT2 encode batch   time:   [4.4834 ms 4.5053 ms 4.5327 ms]
                        change: [-4.6968% -4.1793% -3.6501%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 20 measurements (10.00%)
  2 (10.00%) high mild

BPE GPT2 encode, no cache
                        time:   [15.949 µs 16.350 µs 16.889 µs]
                        change: [-9.8540% -7.9585% -6.2720%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 20 measurements (35.00%)
  4 (20.00%) low severe
  3 (15.00%) high mild

BPE GPT2 encode batch, no cache
                        time:   [4.5396 ms 4.5578 ms 4.5762 ms]
                        change: [-9.9078% -9.1506% -8.2170%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 20 measurements (5.00%)
  1 (5.00%) high severe

BPE Train vocabulary (small)
                        time:   [14.401 ms 14.483 ms 14.599 ms]
                        change: [-6.5321% -5.6261% -4.6370%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high severe

Benchmarking BPE Train vocabulary (big): Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 21.1s.
BPE Train vocabulary (big)
                        time:   [2.1073 s 2.1099 s 2.1129 s]
                        change: [-7.5971% -7.1924% -6.8103%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild

time -v comparisons (new vs old):

  • System time (s): 72.6 vs 76.5
  • Maximum resident set size (kbytes): 913772 vs 915372

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant