Releases: prohippo/ActiveWatch
More Built-in 4-Gram Indices and Literal Indices
This release adds 60 alphabetic 4-grams to the AW built-in index set. In the AW demonstration, this finally pushed the total probability mass of 4-grams over that of 3-grams. It restored the number of output clusters in thAt demonstration to previous levels around 80.
Fix Noisy Clusters in Google News Demonstration
The growth in AW built-in 4- and 5-grams has increased the likelihood of visible noisy matches with common 2- and 3-grams that so far have been missed in noise reduction efforts..
Continuing to Extend Alphabetic 4- and 5-Gram Indexing
Putting more n-grams into an AW index set will generally reduce noise. The latest additions include mostly common English 5-letter words listed by an ESL web site, which became alphabetic 5-grams. This has a big effect on clustering.
General Cleanup
This fixes small various small problems in inflectional and morphological stemming showing up in high-frequency alphabetic 3-grams. More 4- and 5-grams also were added to AW indexing. Also learned how to use MD to make the README release history more readable.
Upload Missing Source Files, New Modules, Extended Indexing
The repository should now finally have all the Java source files to compile the entire AW demonstration system. Sorry about the slip-up here. The latest release also includes two new AW processing modules for user-defined classification profiles to go alongside cluster-defined profiles. The number of alphabetic 4-gram indices is now up to 2,320. Documentation was cleaned up.
Extend 4- and 5-Gram Indexing
Still adding alphabetic 4- and 5-grams for indexing. Now up to 2,290 4-grams and 630 5-grams.
Add N-Gram Indexing options, More 4- and 5-Grams
This allows for experimental indexing without 5-grams or without 4- and 5-grams. The total of alphabetic 4- and 5-grams has increased again to reach indexing entropy of 11.6 bits.
Add Diagnostic Tools, Extend 4- and 5-Gram Indexing
This release adds the DPRO and DLST diagnostic tools, allowing users to view cluster profiles and match lists in more detail. These will make the AW demo more transparent. The AW built-in n-gram index set has again been extended to over 2,700 alphabetic 4- and 5-grams. The AW User Manual has been extended and cleaned up.
Big Extension of 4-Gram Indexing
This was to see how how much closer we could move the total number of 4-gram occurrences in the AW demonstration to the total number of 3-gram occurrences. There is still a ways to go here, but indexing entropy did increase again.
Minor Cleanup and Further Extension of Built-In 4- and 5-Grams
The number of AW built-in alphabetic 4-grams has reached 2,020. This probably will keep rising, but for philosophical and practical reasons, an AW upper limit for 4-grams has been set at 2,500. It is getting harder to find hundreds more alphabetic 4-grams that are common enough to make much difference in n-gram indexing.