Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

experiment: switch from zlib to Zstd for index file #1438

Draft
wants to merge 3 commits into
base: staged
Choose a base branch
from

Conversation

shenlebantongying
Copy link
Collaborator

Most chunks in the index file are small, but some formats appear to be slightly large, like mdx, sometimes up to 64Kib for one chunk. The total index file size is usually less than 1 Mib, but sometimes it can go up to 10~15Mib.

Better compression library may lead to obvious improvement in indexing time.

Zstd on its website claims that it is better than zlib in all aspects. Various benchmarks online confirm this.


I added a simple time measurement to mdx's index file creation to compare zstd & zlib.

Run GD with a single large MDX dict, then grep ms from stdout.

Measured with release build, MacBook Air (M1, 2020), a ~140 mb mdx dictionary.

Both generated index files are ~3.5 mb, the diff is less than 0.2 mb.

Indexing time can be reduced by >10% (note that the measurement time also includes unrelated things like file writing.)

Screenshot 2024-03-24 at 4 45 50 PM

Spreadsheet file used (To download -> top left -> file -> download) https://docs.google.com/spreadsheets/d/1In6Qvpp3M1GmWPN4L6AdLkKnLFnF9MUUwWedDdVG0Ms/edit?usp=sharing


An alternative is lz4 which is significantly faster on decompression/compression, but the compression ratio is lower. The cost of having large files, like for slow disks (The benefit of faster compression have to outweigh the cost of larger file writing, I don't know.).

The default compression level of Zstd is 3.

The default compression level of zlib is 6.

Since in our use case, the size is small, some adjustment may yield a better result.


if ( compress( &bufferCompressed.front(), &compressedSize, &buffer.front(), bufferUsed ) != Z_OK )
const size_t size_or_err =
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

API design is different.

In zlib,

compress will write the size written to its 2nd paramater

In Zstd,

compress will return the size written or error code. facebook/zstd#1825 (comment)

@shenlebantongying shenlebantongying changed the title feat: uses Zstd for index file instead of Zlib (and index building benchmarks) feat: uses Zstd for index file instead of zlib for faster indexing Mar 24, 2024
@shenlebantongying
Copy link
Collaborator Author

I run the same benchmark on my Linux box,

this dict: https://jitendex.org/pages/downloads.html

in debug build

The speedup is around 8%

image

@shenlebantongying shenlebantongying changed the title feat: uses Zstd for index file instead of zlib for faster indexing feat: switch from zlib to Zstd for index file to boost indexing time by >10% Mar 24, 2024
@xiaoyifang
Copy link
Owner

xiaoyifang commented Mar 25, 2024

The speedup is around 8%

I think 8% is not worth the trouble.

I would like to replace the entire index file with leveldb or rocksdb or even xapian which will give a boost in the headword browse requirement.

The current index structure does not perform well when browse all the headwords when the dictionary has a very large amount of headwords.

@shenlebantongying
Copy link
Collaborator Author

I think 8% is not worth the trouble.

But it is a consistent improvement for the moment.

I would like to replace the entire index file with leveldb or rocksdb or even xapian which will give a boost in the headword browse requirement.

Ok, we will get there. I find the main of the challenge is dealing with the existing code rather than writing the new one 😅. Need lots of time

@shenlebantongying shenlebantongying changed the title feat: switch from zlib to Zstd for index file to boost indexing time by >10% experiment: switch from zlib to Zstd for index file Mar 25, 2024
@xiaoyifang
Copy link
Owner

But it is a consistent improvement for the moment.

I think the main concern is that using the new compression method will make users to reindex all the dictionaries.

@shenlebantongying
Copy link
Collaborator Author

Yes, but it is a one-time cost.

(However, it is not one-time cost for someone who switching between the original version and this.)

@xiaoyifang
Copy link
Owner

xiaoyifang commented Mar 25, 2024

It is also cause compatbile issue between our own releases.

compression time & uncompression time should be both considered.

Maybe we can start a beta version to try all the incompatible changes. such as unify dictionaryId generation logic between portable and normal version .

@shenlebantongying
Copy link
Collaborator Author

shenlebantongying commented Mar 25, 2024

I am unsure how to proceed. I believe most users of this problem are not really technical, breakages are devastating for them.

Maybe we should label issues that will need a breakage to know the scope of the problem?

beta version

I think we should call it “optimized version” to give a reason for migration. In the release page, we say it includes optimizations that aren't possible to keep compatibility with the original GD and previous GD-ng versions. A little psychological trick 😅

@xiaoyifang
Copy link
Owner

I am unsure how to proceed

create a beta branch ,enable this branch auto build when pushed changes. and make an Attention in release note about the incompatible issue.

the beta version can be co-exist with the alpha version .

@shenlebantongying
Copy link
Collaborator Author

Maybe we should accumulate features (both planed & implemented) before publishing it, to avoid the cost of switching back and forth.

We can also just reuse the main branch. Just add lots of cumbersome #if FEATURE_XXX_ENABLED and add a compile option ENABLE_BREAHKING_CHANGES. One workflow can build and publish both versions.

A new page in doc is needed like: optimized version changes (rationals, issuses...)

@shenlebantongying shenlebantongying added the vNext Improvments and optimizations that need incompatible changes. label Mar 25, 2024
@xiaoyifang
Copy link
Owner

We can also just reuse the main branch. Just add lots of cumbersome #if FEATURE_XXX_ENABLED and add a compile option ENABLE_BREAHKING_CHANGES. One workflow can build and publish both versions.

the code will become too complex in the future.

@shenlebantongying
Copy link
Collaborator Author

shenlebantongying commented Mar 25, 2024

TBH, I don't have lots of spare time anymore. I prefer to work on gradually replacing the current index implementation, or at least make it simple to replace. 😅

Maybe at some point we can declare that the main branch is in maintenance mode and only get critical bug fixes only. All new code enters the beta branch as you said.

@xiaoyifang
Copy link
Owner

xiaoyifang commented Mar 26, 2024

compression speed is not the only thing to consider, the time to uncompres ,the disk consumption etc should also be considered.

I guess if no compression method is used ,it should be more faster.

A more elegant way should be consider all the followings,such as

The index structure,
compression algorithm | compressed size|
compatibility
etc.

@shenlebantongying shenlebantongying marked this pull request as draft April 5, 2024 13:26
Copy link

sonarcloud bot commented Apr 5, 2024

Quality Gate Failed Quality Gate failed

Failed conditions
0.0% Coverage on New Code (required ≥ 80%)

See analysis details on SonarCloud

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
vNext Improvments and optimizations that need incompatible changes.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants