Releases: prohippo/ActiveWatch
Add Diagnostic Tools and Clarify Command Line Usage
This brings the DKYW diagnostic tool from original Java reimplementation into the new AW repository. DKYW was extracted from the KEYWDR main module to facilitate testing and experimentation. Added some usage advisories for DKYW and DQBE.
Add Diagnostic Tools Showing Details in Scaling Similarity Scores
The DSIM and DSMX tools were in the original C implementation of AW. They have been used to test the computation of statistically scaled similarity, but can also help in identifying which n-grams are contributing the most to a final match score. There are two tools here because DSIM works with a noise model for similarity expected for random pairs of text items, DSMX works with scores of random items with a single fixed profile.
Add Basic Search With Profiles to Testing Tools
The QSRV, DQBE, and DQBK test tools were in the original Java reimplementation of ActiveWatch over two decades ago. They implement a basic search capability helpful for checking newer AW main modules that contain code written since 2021. They can also provide a direct way of exploring an AW data set, though falling short of being a full-fledged search engine.
Add Analytic Capabilities
This release adds three modules (PLOTTR, RANKER, HUBBER) to support analysis of Twitter text data and reorganizes and cleans up various source code files. The new modules can also more generally be applied to non-Twitter data. They sit on top of the current AW automatic clustering capability and actually involve only a little new code..
Fix Inflectional Stemming Bugs
This corrects some long-standing problems with the code for inflectional stemming. These were in the original C source code for ActiveWatch, but were carried over in its incomplete migration to Java two decades ago. Although the change makes little appreciable difference, it will help to making report more easily interpreted.
Enlarge Batch Size for Clustering of Text Items
This is to facilitate analysis of Twitter data, which will usually have many items of short length. The previous batch limit of 8,192 has been increased to 16,384, and data structures have been expanded to accommodate more resulting clusters. Earlier batch limits reflect the capabilities of computing hardware decades ago.
Upgrade AW Clustering to Handle Twitter Data
Twitter data has short actual text and many different special markers that can add noise to indexing. This release extends default alphabetic 4-grams and literal n-grams to reduce noise in general and also makes it easier to control the AW clustering algorithm.
Fix Bug in Buffering UTF-8 Text Input
This release corrects a long-time bug in the buffering of UTF-8 text data input for analysis. The original C code for AW was ASCII only and had to be completely rewritten for its Java reimplementation. Processing errors in v2.3 become evident for text with emoticons, but problems could also arise with non-ASCII punctuation.
Add 4-Gram Indices, Fix Vector Squeezing Bug for Clustering
This adds 80 alphabetic 4-grams that probably should have been included previously for text indexing. Indexing entropy increased slightly as a result, indicating we are on the right track. A bug in the squeezing of index vectors before clustering was discovered because of the change in indexing and was fixed.
Fix Stopword Lookup Bug When Some Have Periods or Apostrophes
The bug was in the handling of periods or apostrophes in stopwords. In special cases, these could make ordinary stopwords unfindable in the AW stopword table. The bug first became noticeable in v2.0.