Skip to content

Releases: prohippo/ActiveWatch

Add 4-Gram Indices For Chemical Nomenclature

20 Feb 22:06
Compare
Choose a tag to compare

This mostly prepares for a new AW demonstration with scientific text.

Minor Substitution of 4-Grams

07 Feb 01:12
Compare
Choose a tag to compare

We need IOUS as an indexing feature. It replaces OXYL to keep the AW 4-gram count at 2,960. The change from v/2.9.2 tp v2.9/2.1 should top have only a minor effect on AW..

Larger N-Gram Index Set

30 Jan 20:05
Compare
Choose a tag to compare

Another round of adding 4- and 5-grams for sharper finite indexing. This produces fewer clusters than before, but they should be less noisy. We are reaching thempoint of demising returns

v2.9.1

01 Jan 03:52
Compare
Choose a tag to compare

One more expansion of built-in AW indexing.

More N-Grams, Extend Stemming Rules, Clean up Profile Generation

08 Oct 23:30
Compare
Choose a tag to compare

Miscellaneous improvements: add 40 4-grams and 10 5-grams, fix errors and omissions in morphological stemming rules, improve output of DCMS tool, simplify profile generation for keyword scanning and clean up source code.

More 4- and 5-Grams, Generalize Sequential Scan of Vectors

21 Sep 02:22
Compare
Choose a tag to compare

This is a simple extension of v2.8.3, with 60 new 4-grams and 10 new 5-grams. It also upgrades AW sequential vector to be more like a search engine.

Extend Indexing, Update Core Modules, New and Modified Tools

15 Sep 23:25
Compare
Choose a tag to compare

Miscellaneous upgrade of AW capabilities: 20 new 4-gram indices and 50 new 5-gram indices; lar er link buffer for CLUSTR, and extend WATCHR to optionally show vectors with high-probability indices as well as low-probability ones; upgrade DSSG and DSRV, and add DKYS, DKTG, AND DTOP tools. The changes were in support of analysis of preaidential tweets (2017-2021).

Major Bug Fix, Added Indexing 4-Grams, Better Labeling of Output

15 Sep 17:22
8727933
Compare
Choose a tag to compare

Fixes a major bug with mapping of 2- and 3-grams into overlapping ranges. This would add indexing noise, which would increase the number of clusters obtained with a text data set. The new code produces about the same number of clusters with the Google News text, but had fewer unclustered items. Also added 40 more indexing 4-grams and improved the labeling of output from diagnostic tools.

v2.8.1

29 Jun 03:43
Compare
Choose a tag to compare

This release adds 60 more alphabetic 4-grams to the AW built-in index set. This finally nudges up indexing entropy a bit. A new tool DCMS now lets a user find out what profiles were matched by a given text segment. Documentation now refers to "user-defined n-grams" instead of "literal n-grams."

Check Effect of Adding 100 Alphabetic 4-Gram Indices

23 Mar 12:44
Compare
Choose a tag to compare

We should be close to a point of diminishing returns for indexing of English text with more alphabetic 4-grams. Release v2.8 now has about 11 thousand 2-, 3-, 4-, and 5-gram built-in indices, plus up to another 2 thousand user-provided prefix and suffix indices (or literal n-grams). The entropy of indexing with the Google News sample seems to be be stuck at about 11.60 bits, but defining another 100 4-grams seems doable to see how big a finite set of indices might grow uo.