experiment: switch from zlib to Zstd for index file #1438

shenlebantongying · 2024-03-24T21:19:30Z

Most chunks in the index file are small, but some formats appear to be slightly large, like mdx, sometimes up to 64Kib for one chunk. The total index file size is usually less than 1 Mib, but sometimes it can go up to 10~15Mib.

Better compression library may lead to obvious improvement in indexing time.

Zstd on its website claims that it is better than zlib in all aspects. Various benchmarks online confirm this.

I added a simple time measurement to mdx's index file creation to compare zstd & zlib.

Run GD with a single large MDX dict, then grep ms from stdout.

Measured with release build, MacBook Air (M1, 2020), a ~140 mb mdx dictionary.

Both generated index files are ~3.5 mb, the diff is less than 0.2 mb.

Indexing time can be reduced by >10% (note that the measurement time also includes unrelated things like file writing.)

Spreadsheet file used (To download -> top left -> file -> download) https://docs.google.com/spreadsheets/d/1In6Qvpp3M1GmWPN4L6AdLkKnLFnF9MUUwWedDdVG0Ms/edit?usp=sharing

An alternative is lz4 which is significantly faster on decompression/compression, but the compression ratio is lower. The cost of having large files, like for slow disks (The benefit of faster compression have to outweigh the cost of larger file writing, I don't know.).

The default compression level of Zstd is 3.

The default compression level of zlib is 6.

Since in our use case, the size is small, some adjustment may yield a better result.

shenlebantongying · 2024-03-24T21:22:30Z

src/chunkedstorage.cc


-  if ( compress( &bufferCompressed.front(), &compressedSize, &buffer.front(), bufferUsed ) != Z_OK )
+  const size_t size_or_err =


API design is different.

In zlib,

compress will write the size written to its 2nd paramater

In Zstd,

compress will return the size written or error code. facebook/zstd#1825 (comment)

shenlebantongying · 2024-03-24T22:36:52Z

I run the same benchmark on my Linux box,

this dict: https://jitendex.org/pages/downloads.html

in debug build

The speedup is around 8%

[autofix.ci] apply automated fixes a

xiaoyifang · 2024-03-25T00:47:50Z

The speedup is around 8%

I think 8% is not worth the trouble.

I would like to replace the entire index file with leveldb or rocksdb or even xapian which will give a boost in the headword browse requirement.

The current index structure does not perform well when browse all the headwords when the dictionary has a very large amount of headwords.

shenlebantongying · 2024-03-25T05:55:31Z

I think 8% is not worth the trouble.

But it is a consistent improvement for the moment.

I would like to replace the entire index file with leveldb or rocksdb or even xapian which will give a boost in the headword browse requirement.

Ok, we will get there. I find the main of the challenge is dealing with the existing code rather than writing the new one 😅. Need lots of time

xiaoyifang · 2024-03-25T06:55:39Z

But it is a consistent improvement for the moment.

I think the main concern is that using the new compression method will make users to reindex all the dictionaries.

shenlebantongying · 2024-03-25T07:27:50Z

Yes, but it is a one-time cost.

(However, it is not one-time cost for someone who switching between the original version and this.)

xiaoyifang · 2024-03-25T08:41:06Z

It is also cause compatbile issue between our own releases.

compression time & uncompression time should be both considered.

Maybe we can start a beta version to try all the incompatible changes. such as unify dictionaryId generation logic between portable and normal version .

shenlebantongying · 2024-03-25T08:59:52Z

I am unsure how to proceed. I believe most users of this problem are not really technical, breakages are devastating for them.

Maybe we should label issues that will need a breakage to know the scope of the problem?

beta version

I think we should call it “optimized version” to give a reason for migration. In the release page, we say it includes optimizations that aren't possible to keep compatibility with the original GD and previous GD-ng versions. A little psychological trick 😅

xiaoyifang · 2024-03-25T09:07:39Z

I am unsure how to proceed

create a beta branch ,enable this branch auto build when pushed changes. and make an Attention in release note about the incompatible issue.

the beta version can be co-exist with the alpha version .

shenlebantongying · 2024-03-25T09:38:34Z

Maybe we should accumulate features (both planed & implemented) before publishing it, to avoid the cost of switching back and forth.

We can also just reuse the main branch. Just add lots of cumbersome #if FEATURE_XXX_ENABLED and add a compile option ENABLE_BREAHKING_CHANGES. One workflow can build and publish both versions.

A new page in doc is needed like: optimized version changes (rationals, issuses...)

xiaoyifang · 2024-03-25T09:49:26Z

We can also just reuse the main branch. Just add lots of cumbersome #if FEATURE_XXX_ENABLED and add a compile option ENABLE_BREAHKING_CHANGES. One workflow can build and publish both versions.

the code will become too complex in the future.

shenlebantongying · 2024-03-25T10:01:39Z

TBH, I don't have lots of spare time anymore. I prefer to work on gradually replacing the current index implementation, or at least make it simple to replace. 😅

Maybe at some point we can declare that the main branch is in maintenance mode and only get critical bug fixes only. All new code enters the beta branch as you said.

xiaoyifang · 2024-03-26T00:33:46Z

compression speed is not the only thing to consider, the time to uncompres ,the disk consumption etc should also be considered.

I guess if no compression method is used ,it should be more faster.

A more elegant way should be consider all the followings,such as

The index structure,
compression algorithm | compressed size|
compatibility
etc.

sonarcloud · 2024-04-05T13:40:20Z

Quality Gate failed

Failed conditions
0.0% Coverage on New Code (required ≥ 80%)

See analysis details on SonarCloud

TEMP: add basic time measure to mdx

2cea76f

shenlebantongying commented Mar 24, 2024

View reviewed changes

shenlebantongying force-pushed the feat/zstd-index-file branch from 655ecf9 to 7a5a5d8 Compare March 24, 2024 21:31

shenlebantongying changed the title ~~feat: uses Zstd for index file instead of Zlib (and index building benchmarks)~~ feat: uses Zstd for index file instead of zlib for faster indexing Mar 24, 2024

shenlebantongying changed the title ~~feat: uses Zstd for index file instead of zlib for faster indexing~~ feat: switch from zlib to Zstd for index file to boost indexing time by >10% Mar 24, 2024

feat: use Zstd for index file compression instead of zlib

e0cb233

[autofix.ci] apply automated fixes a

shenlebantongying force-pushed the feat/zstd-index-file branch from ab8a671 to e0cb233 Compare March 24, 2024 23:14

[autofix.ci] apply automated fixes

7d22d22

shenlebantongying closed this Mar 25, 2024

shenlebantongying changed the title ~~feat: switch from zlib to Zstd for index file to boost indexing time by >10%~~ experiment: switch from zlib to Zstd for index file Mar 25, 2024

shenlebantongying added the vNext Improvments and optimizations that need incompatible changes. label Mar 25, 2024

shenlebantongying reopened this Apr 5, 2024

shenlebantongying marked this pull request as draft April 5, 2024 13:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

experiment: switch from zlib to Zstd for index file #1438

experiment: switch from zlib to Zstd for index file #1438

shenlebantongying commented Mar 24, 2024

shenlebantongying Mar 24, 2024

shenlebantongying commented Mar 24, 2024

xiaoyifang commented Mar 25, 2024 •

edited

Loading

shenlebantongying commented Mar 25, 2024

xiaoyifang commented Mar 25, 2024

shenlebantongying commented Mar 25, 2024

xiaoyifang commented Mar 25, 2024 •

edited

Loading

shenlebantongying commented Mar 25, 2024 •

edited

Loading

xiaoyifang commented Mar 25, 2024

shenlebantongying commented Mar 25, 2024

xiaoyifang commented Mar 25, 2024

shenlebantongying commented Mar 25, 2024 •

edited

Loading

xiaoyifang commented Mar 26, 2024 •

edited

Loading

sonarcloud bot commented Apr 5, 2024


		if ( compress( &bufferCompressed.front(), &compressedSize, &buffer.front(), bufferUsed ) != Z_OK )
		const size_t size_or_err =

experiment: switch from zlib to Zstd for index file #1438

Are you sure you want to change the base?

experiment: switch from zlib to Zstd for index file #1438

Conversation

shenlebantongying commented Mar 24, 2024

shenlebantongying Mar 24, 2024

Choose a reason for hiding this comment

shenlebantongying commented Mar 24, 2024

xiaoyifang commented Mar 25, 2024 • edited Loading

shenlebantongying commented Mar 25, 2024

xiaoyifang commented Mar 25, 2024

shenlebantongying commented Mar 25, 2024

xiaoyifang commented Mar 25, 2024 • edited Loading

shenlebantongying commented Mar 25, 2024 • edited Loading

xiaoyifang commented Mar 25, 2024

shenlebantongying commented Mar 25, 2024

xiaoyifang commented Mar 25, 2024

shenlebantongying commented Mar 25, 2024 • edited Loading

xiaoyifang commented Mar 26, 2024 • edited Loading

sonarcloud bot commented Apr 5, 2024

Quality Gate failed

xiaoyifang commented Mar 25, 2024 •

edited

Loading

xiaoyifang commented Mar 25, 2024 •

edited

Loading

shenlebantongying commented Mar 25, 2024 •

edited

Loading

shenlebantongying commented Mar 25, 2024 •

edited

Loading

xiaoyifang commented Mar 26, 2024 •

edited

Loading