Skip to content

Releases: ilius/pyglossary

PyGlossary 4.7.1

16 Sep 17:28
Compare
Choose a tag to compare

Changes since 4.7.0

Breaking changes:

4c78aa4 replace CC-CEDICT plugin with EDICT2 plugin

Bug fixes and improvements:

f5a420c Bugfix: Glossary: removeHtmlTagsAll was ineffective with --sort same for preventDuplicateWords
01b5606 Yomichan: merge entries with same headword, #574
5fe93f4 Yomichan: add beautifulsoup4 to dependencies, #577
2a23966 use python3 in scripts/view-glossary and scripts/diff-glossary to bypass pyenv
c878cbd zimfile: replace OSError on Windows with a warning, #580
1573d5c Wiktextract: rewrite writeSenseExample and fix #572 - Fix TypeError: got invalid input value of type <class 'list'> - Create a list of examples - Add the example type as prefix in bold
7f64af5 Wiktextract: keep warnings in a Counter, remove duplicate messages and show at end

New Features

aa6765b add new plugin xdxf_css (XdxfCss) based on PR #570 by @soshial
0e9d221 add read_options to .info file
fea2223 StarDict Textual writer: save resource files in res/ folder, #558
3800fac add Dyula language, #575
08c41da add glos.readOptions property

Refactoring, linting and testing

6786880 fix ruff preview error in appledict_bin/init.py
fd09e16 github actions: switch to ruff 0.5.2
019740e fix ruff error
69bcbf9 fix ruff preview error: B909 Mutation to loop iterable during iteration
5596b7f switch to ruff 0.6.4
03a509b fix ruff preview errors, use str.removesuffix
6ca9902 fix some mypy errors
eac286b github test: use lxml==5.2 to fix jmdict test
f2eb39d move info writer out of plugins
578c854 fix tests: test_save_info_json
0f4d885 update pyproject.toml
1e20a1a format pyglossary/glossary_v2.py
e231b64 update scripts/format-code
4aa4f09 github action test: remove test cache
acdbede github test: upload failed test files
1f095ad fix test action
9df1ed6 update jmdict test and switch to lxml==5.3

Full Changelog: 4.7.0...4.7.1

PyGlossary 4.7.0

16 Jun 17:32
b41161d
Compare
Choose a tag to compare

What's Changed

New Contributors

Full Changelog: 4.6.1...4.7.0

PyGlossary 4.6.1

10 Mar 12:48
5f6ebd6
Compare
Choose a tag to compare

Changes since 4.6.0

Bug fixes

  • Fix a bug causing broken installation if ~/.local/lib is a symbolic link

    • or site-packages or any of its parents are a symbolic link
  • Fix incompatibilty with Python 3.9 (despite documentation)

  • Fix scripts/entry-filters-doc.py, scripts/plugin-doc.py and doc/entry-filters.md

  • AppleDict: Fix typos in Chinese language module

Features:

  • Use environment variable VERBOSITY as default (a number from 0 to 5)

Improvements

  • AppleDict Binary: set html_full=True by default

  • Update wcwidth to 0.2.6

Refactoring

  • Add glos.stripFullHtml(errorHandler) and use it in 3 plugins

    • Add entry filter StripFullHtml and change entry.stripFullHtml() to return error
  • Refactor entryFiltersRules

  • Remove empty plugin gettext_mo.py

  • Remove glos.titleElement from glossary_v2.Glossary

    • Add to glossary.Glossary for compatibility
    • glossary.Glossary is a wrapper (child class) on top on glossary_v2.Glossary

Documentation

  • Update doc/entry-filters.md to list some entry filters that were enabled conditionally (besides config)

  • Remove sdict.md and sdict_source.md (removed plugins)

Type checking

  • Add missing method in GlossaryType class
  • Fix mypy errors on most of code base and some of plugins
  • Use builtin types list, dict, tuple, set for type annotations
  • Replace Optional[X] with X or None
    • will not effect runtime, but type checking now only works with Python 3.10+

PyGlossary 4.6.0

07 Mar 11:27
4b7ae78
Compare
Choose a tag to compare

Changes since 4.5.0

Dependency change

We now require Python 3.9 or a later version.

Bug fixes

  • Fix exception in scripts/plugin-index.py: 8a94b8c

  • StarDict: Fix writing to .zip file produced empty zip, and fix bad test

  • dictunformat: fix #367: add option headword_separator, default to ;

  • Fixes in ui_gtk, #380 #382 #403

  • AppleDict source: fix #407 missing quotes for title, and refactor duplicate codes

  • DictionaryForMIDs: remove | from word when normalizing, fix punctuation regex, use Unix newlines

  • StarDict: use Unix newline when reading and writing .ifo file on Windows

  • Fix bug of glos.addEntryObj(dataEntry) adding empty file because tmpDataDir is not set until glos.read()

    • Set and create tmpDataDir on glos.tmpDataDir access, and add test, #424
  • Fix scripts/wiki-formats.py, #428

  • Dictd / Dict.org: fix exception on Windows

Features

  • Support sorting by an ICU locale, see Sorting section of README

  • Add Gtk4 interface --ui=gtk4 / --gtk4

    • still buggy and not as functional as Gtk3 or Tkinter interfaces
  • Add flag --optimize-memory, config key optimize_memory

    • To enable entry compression on --indirect
    • Not enabled by default (it was previously always compressed)
  • Allow plugin's reader.open() to return an Iterator for progress bar

    • Implement for Tabfile (reading info/metedata)
    • Implement for AppleDict Binary (reading KeyText.data)
  • Add read and write support for StarDict Textual File (.xml), #348

  • Add support for writing Yomichan dictionary files, #395 by @tomtung

  • StarDict reader: support .syn.dz file, #410

  • StarDict writer: add write option large_file, #392 #422

  • StarDict reader: support dxoffsetbits=64 on read, #392 #422

  • JMDict: support examples, #383

  • Add read support for JMnedict, #386

  • Add flag --skip-duplicate-headword, config skip_duplicate_headword, #365

    • Zim reader: remove option skip_duplicate_words, #365
  • Add flag --trim-arabic-diacritics, config trim_arabic_diacritics, #366

  • Add read support for IUPAC goldbook (.xml), #355

  • Add write support for DIKT JSON

  • StarDict writer: limit memory usage by using SQLite for idx and syn data, #409

  • CSV: add newline option, defaulting to Unix-style

  • Aard2 Slob writer: add option file_size_approx_check_num_entries

  • Add scripts/diff-glossary and scripts/view-glossary

Improvements

  • When remove HTML tags, also replace <div> with \n, #394 by @tomtung

    • Treat <div> the same way <p> is treated.
  • Mobi: add mobi7-forcing switch to kindlegen command, #374 by @holyspiritomb

  • Octopus MDict: ignore directories with same_dir_data_files, #362

  • StarDict reader: handle definitions with mixed types/formats

  • Dictfile: strip whitespaces from word and defi before going through entry filters

  • BGL: strip whitespaces from word and defi before going through entry filters

  • Improvement in glos.write: avoid printing exception for invalid encoding

  • Remove empty logs in glos.convert

  • StarDict reader: fix validating sametypesequence, and add test

  • glos.convert: Allow an existing empty directory as output path

  • TextGlossaryReader: replace nextPair method with nextBlock which returns resource files as third item

  • ui_cmd_interactive: allow converting several times before exiting

  • Change title tag for Greek from <big> to <b>

  • Update language data set (langs.json)

  • ui/main.py: print 1-line error instead of full exception on ImportError

  • ui/main.py: Windows: try Tkinter before Gtk

  • ebook_base.py: avoid shutil.move on Windows, #368

  • TextGlossaryReader: fix loading info and some refactoring, #370 36b9cd8

  • Entry: Allow word to be tuple in Entry(word=...)

  • glos.iterInfo() return Iterator rather than Iterable

  • Zim: change dependency to libzim>=1.0, and some comments

  • Mobi: work with kindlegen executable in PATH directories, #401

  • ui: limit the length of option comments in Format Options dialog

  • ui_gtk: improvement: show (last) critical error on status bar

  • ui_gtk: set intial focus

  • ui_gtk: improvements in About tab

  • ui_tk: revert most ttk widgets to tk because the theme doesn't match

  • Add SVG icon, #414 by @proletarius101

  • Prevent exception/traceback on Ctrl+C

  • Optimize progress bar

  • Aard2 slob: show info log before and after slobWriter.finalize(), #437

Removed features

  • Remove read support for Wiktiomary Dump, #48

  • Remove support for Sdictionary Binary and Source

Octopus MDict MDX: features and improvements

  • Support MDict V3 fomrat by updating readmdict, #385 by @xiaoqiangwang

  • Fix files created without UUID in header, #387 by @xiaoqiangwang

    • MdxBuilder 4.0 RC2 and before creates files without UUID header
  • Decode mdict title & description if they're bytes, #393 by @tomtung

  • readmdict: Skip zlib decompress exceptions, #384

  • readmdict: Use __name__ as logger name, and add 2 debug logs, #384

  • readmdict: improve exception msg for xxhash, #385

XDXF: fixes / imrovements, issue #376

  • Support <categ>
  • Support embedded tags in <iref>
  • Fix ignoring <mrkd>
  • Fix extra newlines
  • Get rid of warning for <etm>
  • Fix/improve newline and space issues
  • Fix and improve tests
  • Update url for format description
  • Support any tag/string in <ex>, #396
  • Support reading compressed files directly (.xdxf.gz, .xdxf.bz2, .xdxf.lzma)
  • Allow using XSL using --write-options=xsl=True
  • Update XSL
  • Other improvements in XDXF to HTML transformation

AppleDict Binary: features, bug fixes, improvements, refactoring

  • Fix css name on html_full=True

  • Fix using self._encoding when should use utf-8

  • Fix internal links, #343

    • Remove x-dictionary:d: prefix from href
    • First fix for x-dictionary:r:: use title if present
    • Add bword:// prefix to href (unless it points to http/https)
    • Read entry IDs on open and fix links with x-dictionary:r:
  • Add plistlib to dependencies

  • Add tests

  • Replace <entry ...> with <div>

  • Fix bad exception formatting

  • Fixes from PR #436

  • Support morphology (alternates): #434 by @soshial

  • Support different AppleDict offsets, #417 by @soshial

  • Extract AppleDict meta-info (langs, title, author), #418 by @soshial

  • Progress Bar on open() / loading KeyText.data

  • Improve memory usage of loading KeyText.data

  • Replace appledict_bin.py with appledict_bin directory and more refactoring

Glossary class (glossary.py)

  • Lots of refactoring in glossary.py

    • Improve the design and readability
    • Reduce complexity of methods
    • Move some code into new classes that Glossary inherits from
    • Improve error messages
  • Introduce glossary_v2.py, and maintain API backward-compatibility for glossary.py (as far as documented)

Refactoring

  • Fix style errors using ruff based on pyproject.toml configuration

  • Remove all usages of pyglossary.plugins.formats_common

  • Use str.startswith(tuple) and str.endswith(tuple)

  • Reduce complexity of Glossary methods

  • Rename entry filter strip to trim_whitespaces

  • Some refactoring in StarDict reader

  • Use f-string equal syntax added in Python 3.8

  • Use str.removeprefix and str.removesuffix added in Python 3.9

  • langs/writing_system.py:

    • Change iso field to list
    • Add new scripts
    • Add getAllWritingSystemsFromText
    • More refactoring
  • Split up TextGlossaryReader.loadInfo method

  • plugin_manager.py: make some methods private

Documentation

  • Update plugins' documentation

  • Glossary: add comments about entryFilters

  • Update config.rst

  • Update doc/entry-filters.md

  • Update README.md

  • Update doc/sort-key.md

  • Update doc/pyicu.md

  • Update plugins/testformat.py

  • Add types for arguments and result of all functions/methods

  • Add types for r/w options in reader/writer classes

  • Fix a few incorrect type annotations

  • README.md: Add document for adding data entries, #412

  • README.md: Fix -> nixos command, #400 by @srghma

  • Update bgl_info.md and move it from pyglossary/plugins/babylon_bgl/ to doc/babylon/

Testing

  • Add test for DSL -> Tabfile conversion

  • dsl_test.py: fix method names not starting with test_

  • StarDict reader: better testing for handling definitions with mixed types

  • StarDict writer: much better testing, coverage of stardict.py: from %62 to %83

  • Refactoring and improvements in tests of Glossary, along with new tests

  • Add test for dictunformat -> Tabfile

  • AppleDict (source) tests: validate plist file contents

  • Allow forking and branching pyglossary-test repo

  • Fix some failing tests on Windows

  • Slob: test file_size_approx

  • Test Tabfile -> SQL conversion

  • Test StarDict error/warning for sortKeyName with and without locale

  • Print useful messages for unhandled warnings

  • Improve logs

  • Add showDiff=False arg to compareTextFiles and convert

Packaging

  • Update and refactor Dockerfile and run-with-docker.sh

    • Dockerfile: chan...
Read more

PyGlossary 4.5.0

04 Feb 23:19
2433ff5
Compare
Choose a tag to compare

Changes since 4.4.1

Bug fixes

  • Fix 2 log messages in glos._resolveConvertSortParams

  • Fixes and improvements in Dictfile (.df) reader

    • Fix exception: disable loading info (Dicfile does not support info)
    • TextGlossaryReader: prevent producing duplicate data entries
      • This fixes: error in DataEntry.save: [Errno 2] No such file or directory: ... because entry.save() moves the temp file to output path
      • This bug only existed for Dictfile (.df) format.
    • Remove extra colon, #358
    • Remove some extra newline
    • And add test for Dictfile to/from Tabfile
  • Fix not cleaning up temp directory on return with error from glos.convert

Features

  • ui_gtk: add a "General Options" button that opens a dialog for:

    • Settings for sort and sortKey
    • Checkbox for SQLite mode
    • Check boxes for config params: save_info_json, lower, skip_resources, rtl, enable_alts, cleanup, remove_html_all
  • Add support for --sort-key random to shuffle entries

Performance improvements

  • Performance improvement: remove gc.collect() calls in Glossary and *EntryList

    • Not needed since Python 3.8
    • Change minimum python requirement to 3.8 in README.md
  • Do not import all plugin modules (only import two plugins that are used)

    • Load json file plugins-meta/index.json instead
    • In debug mode, all plugin modules are still imported and validated
    • User plugins are still imported

Other improvements

  • Improve detection of languages from glossary name, and add tests
  • Update langs.json: add new 3-letter codes for 25 languages
  • glos.preventDuplicateWords and glos.removeHtmlTagsAll: prevent adding filter twice
  • glos.cleanup: reset path list to avoid (non-critical) error if called again
  • Minor improvements in Glossary.init()
  • DataEntry.save: on FileNotFoundError show a 1-line error instead of log.exception
  • ui_gtk: create a new Glossary object every time Convert button is clicked
  • Add docstring for Glossary.init

Unit testing

  • Update tests/glossary_errors_test.py
  • Add missing cleanup for some temp file
  • add test for LDF to/from Tabfile

Refactoring

  • Plugins: replace import of formats_common from currect directory with pyglossary.plugins.formats_common

  • Fix logging.warn method is deprecated, use warning instead, PR #360 by @BoboTiG

  • Fix DeprecationWarning: invalid escape sequence, PR #361 by @BoboTiG

  • Move some functions from glossary_utils.py to compression.py

  • Move some methods from Glossary to new parent classes PluginManager and GlossaryInfo

  • Some refactoring in plugin_prop.py and plugin_manager.py

    • Rename plugin.pluginModule to plugin.module
    • Minimize direct access to plugin.module, plugin.readerClass or plugin.writerClass
    • Add some new properties to PluginProp
    • Remove a log from glossary.py
    • Disable validation of plugins unless in debug mode
    • plugin_prop.py: fix checking debug level
  • sq_entry_list.py: rename sortColumns to sqliteSortKey

  • Some refactoring around setSortKey between Glossary, EntryList and SqEntryList

  • Remove Entry.sqliteSortKeyFrom and related classmethods

  • Some more simplification in glossary.py

  • Remove Entry.defaultSortKey

  • Some style fixes

  • iter_utils.py: remove unused key= argument from unique_everseen

  • Refactor ui_gtk and update config comments

  • extractInlineHtmlImages: avoid writing file within sub func

PyGlossary 4.4.1

25 Jan 10:22
663748c
Compare
Choose a tag to compare

Changes since 4.4.0

Bug fixes

  • Automatically create cacheDir on Glossary.init()
    • Fixes exception in SQLite mode

Features

  • ui_cmd_interactive: support setting sortKey

Improvements and documentation

  • Wiktionary Dump: remove detect-by-extension
  • glossary.py: update docstrings for sortKeyName
  • sort_keys.py: add desc to NamedSortKey
  • Update doc/sort-key.md

PyGlossary 4.4.0

24 Jan 17:39
cfd61e8
Compare
Choose a tag to compare

Changes since 4.3.0

Breaking changes

  • Remove partial sorting support (obsolete feature)

    • Remove --sort-cache-size flag in command line
    • (For library users) Remove sortCacheSize argument to glos.write and glos.convert
  • Re-design sorting and sortKey parameters

    • Breaking change for library users, and user plugins that need sorting (sortOnWrite = ALWAYS)

    • Change glos.convert

      • Replace argument sortKey (Callable) with sortKeyName (str)
      • Add argument sortEncoding (str) defaulting to utf-8
    • Change glos.write

      • Replace argument sortKey (Callable) with namedSortKey (sort_keys.NamedSortKey)
      • Add argument sortEncoding (str) defaulting to utf-8
    • Change glos.sortWords

      • Replace argument key (Callable) with sortKeyName (str)
      • Add argument sortEncoding (str) defaulting to utf-8
    • Change API of plugins that use sortOnWrite = ALWAYS

      • Replace writer.sortKey and Writer.sqliteSortKey with sortKeyName in plugin module.
      • See the stardict.py for example.

    Note 1: All sortKey and sortEncoding arguments are optional.

    Note 2: Values of sortKeyName are documented in doc/sort-key.md

  • Rename 2 files in doc/:

    • Rename doc/entry_filters.md to doc/entry-filters.md
    • Rename doc/term_colors.md to doc/term-colors.md

Features

  • --sort-key and --sort-encoding command line flags (as part of above re-design)

  • Now SQLite mode works for all output formats.

Bug fixes

  • Fix lack of Progress Bar while writing in indirect or SQLite mode
  • Fix misleading message log about SQLite mode
  • Fix unclosed files in XDXF and FreeDict plugins

Improvements

  • Show a 1-line log instead of FileNotFoundError traceback in glos.read and glos.write
  • Close readers in glos.convert if write failed
  • Fix some type annotations and comments
  • (For library users) Change Glossary.__str__
  • (For library users) glos.setInfo: convert non-str value to str, and add tests

Unit testing

Add new tests and improve existing tests.

  • Coverage of glossary.py: %89
  • Overall coverage of codebase + plugins: %58

Refactoring and design improvements

  • Simplify by passing glos object to EntryList()
  • Replace SqList with SqEntryList
  • Change __iter__ of SqEntryList and EntryList to give entry objects
  • Simplify Glossary by moving gc.collect to EntryList and SqEntryList
  • Remove unused function xml_unescape
  • Remove unused import from FreeDict and JMDict plugins
  • Use operator.itemgetter in stardict.py, dict_cc.py, ebook_kobo.py, reverse.py
  • glossary.py: cleanup, simplify and optimize generators logic
    • Also remove index argument from entryFilter.run method and add some comments
  • Remove redundant check in glos.progress
  • Remove redundant check in _getLangByStr
  • Remove redundant check in Glossary.detectOutputFormat

PyGlossary 4.3.0

15 Jan 12:18
cf4db2b
Compare
Choose a tag to compare

Changes since 4.2.1

Bug fixes

  • Tabfile writer: fix replacing \ with \\
  • --remove-html flag: fix bad regex
  • ui_cmd_interactive: fix a few bugs
  • Lowercase word/entry links (<a href="bword://...) when --lower flag is passed
  • TextGlossaryWriter: do not skip words that start with #
  • Fix StdLogHandler: was not applying --no-color
  • Fix checking for sys.frozen

New features

  • Add auto_sqlite config parameter

    • to use SQLite mode for StarDict and EPUB-2 (which require sorting) by default
    • also allow overriding it with --no-sqlite flag
  • Add 3 config parameters allow changing log colors in terminal:

    • color.cmd.critical
    • color.cmd.error
    • color.cmd.warning
  • Add 2 keys to config to enable/disable colors in Unix and Windows separately

    • color.enable.cmd.unix: default true
    • color.enable.cmd.windows: default false

New features for library users

  • Allow glos.setInfo(key, None) to delete the info / metadata key

  • Add glos.alts property as shortcut, and use it internally

Design improvements

Change rawEntry[0] from bytes to List[str] and avoid split/join when converting rawEntry <-> entry.
This fixes some very edge cases involving | in words, but uses more RAM in indirect mode (converting to StarDict), which can be solved with --sqlite.

Documentation

Unit testing

Coverage of glossary.py: %75

There are 2501 lines of test code in tests directory.

Tests for Glossary class include:

  • Basic functionality
  • Error handling
  • Sorting and direct / indirect / SQLite modes
  • Entry filter config/flags (lower, rtl, remove_html, remove_html_all)
  • Resources / data entries
  • Convert: Tabfile <-> Aard2 slob
  • Convert: Tabfile <-> CSV
  • Convert: Tabfile -> EPUB-2
  • Convert: Tabfile -> JSON
  • Convert: Tabfile <-> StarDict

Other improvements:

  • glossary_test.py: check CRC32 of downloaded test files
  • glossary_test.py: use a new temp dir for each test method for isolation.
  • ebook_kobo_test.py: split into several test methods

Improvements

  • Zim: make improvements, #352
  • Aard2 slob: add 2 mime types, #352
  • ui/main.py: do not allow --remove-html and --remove-html-all together
  • Glossary: do not allow glos.config to be set twice
  • Glossary: change some error logs to critical, and more improvements
  • Prevent conflicting config flags together, like --lower --no-lower
  • Disable utf8_check config parameter by default (not needed since 3.0.0)

Refactoring and cleanup

  • Glossary: some refactoring in convert method
  • Rename 3 scripts in scripts/ directory
  • Remove DataEntry.fromFile and improve behavior of DataEntry.__init__
  • Refactoring in ui/
  • rename option.cmdFlag to option.customFlag
  • Glossary: add glos.rawEntryCompress property, and use in entry.py
  • Glossary: minor improvement in loadPlugins
  • XDXF: remove useless argument in Reader.open
  • remove unused some functions from text_utils.py
  • plugin_prop.py: refactor getExtraOptions
  • Avoid assigning protected attrs in text_writer.py and plugins/tabfile.py
  • Fewer protected attr access in entry_filters.py
  • Move sortKey and get_prefix implementations from ebook_base.py to epub and mobi plugins
  • Change name of 2 entry filters to match the config param

PyGlossary 4.2.1

26 Dec 20:01
c0d0eef
Compare
Choose a tag to compare

Changes since version 4.2.0

Minor bug fixes and improvements:

  • text_utils.py

    • Minor bug: fix legacy function urlToPath using urllib.parse.unquote
    • Minor bug: replacePostSpaceChar: remove trailing space from the output str
    • Cleanup:
      • Remove unused function isControlChar
      • Remove unused function formatByteStr
      • Remove argument exclude from function isASCII
    • Add unit tests
  • ui_cmd_interactive.py: fix a minor bug and some small refactoring

  • Command line: Override input glossary info with --source-lang and --target-lang flags

  • Add unit tests for CSV -> Tabfile conversion

  • CSV plugin: some refactoring, and rename the module to csv_plugin.py

  • Update setup.py: add python_requires=">=3.7.0", update extras_require

  • Update README.md

Fearures:

  • Command line: Add --name flag for changing glossary name
  • Glossary: convert: add infoOverride optional argument

PyGlossary 4.2.0

20 Dec 08:30
1b1450c
Compare
Choose a tag to compare

Changes since 4.1.0

  • Breaking changes:

    • Replace glos.getAuthor() with glos.author
      • This looks for "author" and then "publisher" keys in info/metadata
    • Rename option apply_css to css for mobi and epub2
    • glos.getInfo and glos.setInfo only accept str as key (or a subclass of str)
  • Bug fixes:

    • Indirect mode: Fix handling '|' character in words.

      • Escape/unescape | in words when converting entry <-> rawEntry
    • Escape/unescape | in words when writing/reading text-based file formats

    • JSON: Prevent duplicate keys in json output, #344

      • Add new method glos.preventDuplicateWords()
  • Features and improvements

    • Add SQLite mode with --sqlite flag for converting to StarDict.

      • Eliminates the need to load all entries into RAM, limiting RAM usage.
      • You can add --sqlite to you command, even for running GUI.
        • For example: python3 main.py --tk --sqlite
      • See README.md for more details.
    • Add --source-lang and --target-lang flags

    • XDXF: support more tags and improvements

    • Add unit tests for Glossary class, and some functions in text_utils.py

    • Windows: change cache directory to %LOCALAPPDATA%

    • Some refactoring and optimization

    • Update, improve and re-format documentations