Releases: prohippo/ActiveWatch
Add 4-Gram Indices For Chemical Nomenclature
Minor Substitution of 4-Grams
We need IOUS as an indexing feature. It replaces OXYL to keep the AW 4-gram count at 2,960. The change from v/2.9.2 tp v2.9/2.1 should top have only a minor effect on AW..
Larger N-Gram Index Set
Another round of adding 4- and 5-grams for sharper finite indexing. This produces fewer clusters than before, but they should be less noisy. We are reaching thempoint of demising returns
v2.9.1
More N-Grams, Extend Stemming Rules, Clean up Profile Generation
Miscellaneous improvements: add 40 4-grams and 10 5-grams, fix errors and omissions in morphological stemming rules, improve output of DCMS tool, simplify profile generation for keyword scanning and clean up source code.
More 4- and 5-Grams, Generalize Sequential Scan of Vectors
This is a simple extension of v2.8.3, with 60 new 4-grams and 10 new 5-grams. It also upgrades AW sequential vector to be more like a search engine.
Extend Indexing, Update Core Modules, New and Modified Tools
Miscellaneous upgrade of AW capabilities: 20 new 4-gram indices and 50 new 5-gram indices; lar er link buffer for CLUSTR, and extend WATCHR to optionally show vectors with high-probability indices as well as low-probability ones; upgrade DSSG and DSRV, and add DKYS, DKTG, AND DTOP tools. The changes were in support of analysis of preaidential tweets (2017-2021).
Major Bug Fix, Added Indexing 4-Grams, Better Labeling of Output
Fixes a major bug with mapping of 2- and 3-grams into overlapping ranges. This would add indexing noise, which would increase the number of clusters obtained with a text data set. The new code produces about the same number of clusters with the Google News text, but had fewer unclustered items. Also added 40 more indexing 4-grams and improved the labeling of output from diagnostic tools.
v2.8.1
This release adds 60 more alphabetic 4-grams to the AW built-in index set. This finally nudges up indexing entropy a bit. A new tool DCMS now lets a user find out what profiles were matched by a given text segment. Documentation now refers to "user-defined n-grams" instead of "literal n-grams."
Check Effect of Adding 100 Alphabetic 4-Gram Indices
We should be close to a point of diminishing returns for indexing of English text with more alphabetic 4-grams. Release v2.8 now has about 11 thousand 2-, 3-, 4-, and 5-gram built-in indices, plus up to another 2 thousand user-provided prefix and suffix indices (or literal n-grams). The entropy of indexing with the Google News sample seems to be be stuck at about 11.60 bits, but defining another 100 4-grams seems doable to see how big a finite set of indices might grow uo.