Skip to content

Commit

Permalink
for release v2.1
Browse files Browse the repository at this point in the history
  • Loading branch information
prohippo committed Oct 28, 2022
1 parent 4de3e6d commit def96a6
Show file tree
Hide file tree
Showing 5 changed files with 47 additions and 29 deletions.
Binary file modified AWug.pdf
Binary file not shown.
Binary file modified HowtoIndexText.pdf
Binary file not shown.
71 changes: 44 additions & 27 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ AW lexical features for indexing text currently fall into three types:
(1) all alphanumeric 2-grams, like TH, F1, 2X, or 00; (2) selected
alphabetic n-grams (for n > 2), like QUE, REVE, and CLASS; and (3) a
fixed number of user-defined alphanumeric word beginnings and endings,
like THERMO- and -MOTHER.
like -000000000, THERMO-, and -MOTHER.

Indexing with word fragments will tend to be noisier than with whole
words. For example, if text item has a long word like CONFABULATE that
Expand All @@ -29,40 +29,43 @@ crossword puzzle fan knows that word fragments do carry ihformation.

So, how big would our finite index set have to be to support useful text
analysis? The ActiveWatch demonstration makes the case that 10⁴ should
be enough in English for automatic clustering of text items by content
or for detecting highly unusual content in a dynamic text stream. You
should look elsewhere, however, if you just want to find all documents
be enough in English for automatic clustering of short text items by
content or for detecting highly unusual content in a dynamic text stream.
You should look elsewhere, however, if you just want to find all documents
containing a specific word.

The big advantage of a finite vector representation of text is
that it lets us organize system processing at a higher level of
abstraction. Once we encode text as vectors, it should not matter
where these vectors came from. We care only that they are easy to work
with and carry enough information for the purposes of our users.
The advantage of a finite vector representation of text is that it lets us
organize information processing at a level of abstraction that simplifies
the computations cwa system must carry out. Once we encode text as vectors,
it should not matter where these vectors came from. We care only that they
are convenient to work with and carry enough information for the needs and
purposes of information users.

Vector data of finite dimensionality makes a statistically scaled
similarity measure possible. This is easy to interpret and allows a text
processing system to make decisions on its own without a human always
hovering around like a helicopter parent. Real-time systems with dynamic
data become more manageable and more resiliant in unexpected situations.
Vector data of finite dimensionality makes a statistically scaled measure
of similarity possible. Such scaling makes a measure easier to interpret
and allows a text processing system to make decisions reliably on its own.
Human users must otherwise hover around like a helicopter parent for quality
control. This is especiallly critical in real-time systems with dynamic data;
they become more manageable and more resiliant in unexpected situations.

AW will score similarity by the number of standard deviations that a raw
inner product similarity score falls within a theoretical noise distribution.
This noise will be roughly Gaussian; an AW scaled similarity of 3 standard
deviations should be significant at about p = .003. With actual text data,
we can expect AW scaled similarity above 6 standard deviations.
vector inner product similarity score falls above the mean of a theoretical
noise distribution. This noise will be roughly Gaussian, so that an AW scaled
similarity of 3 standard deviations should be significant at about p = .003.
With actual text data, AW should typically work with scaled similarity well
above 6 standard deviations.

Some index tuning is needed to achieve such performance. This will mainly
involve adjustments of the indexing features defined by a users for particular
target text data. Automatic stemming and stopword deletion also allows AW
users to exclude purely grammatical instances of n-grams like ING, MENT,
or ATION when indexing text content.
or ATION when indexing text for content.

AW was first written in C around 1982 for information discovery in unfamiliar
text data. The current Java version dates back to around 1999, but has some
recent tweaks in its linguistic analysis and addition of 4- and 5-letter
word fragments for indexing. Only 2- and 3-letter fragments, plus user-defined
indices, were built into AW previously.
recent tweaks in its linguistic analysis and addition of 4- and 5-letter word
fragments for more precise indexing. Only 2- and 3-letter fragments, plus
user-defined indices, were built into AW previously.

The modules included in the AW GitHub repository mainly provide support for
simple clustering of text items by content. The code is organized functionally
Expand All @@ -72,9 +75,9 @@ a somewhat new programming language.

Java AW eventually evolved to support many kinds of statistical natural
language processing, but this GitHub repository includes only a small subset of
modules for automatic clustering of text items in particular. This software
should give you a good overall idea of what you can do with AW finite indexing
and statistically scaled similarity between pairs of item vectors.
modules to demonstrate automatic clustering of text items in particular. This
software should give you a good overall idea of what you can do with AW finite
indexing and statistically scaled similarity between pairs of finite item vectors.

The latest AW release includes fifteen prebuilt AW modules. These might support
military intelligence operations or the tagging of news streams for resale
Expand All @@ -85,7 +88,7 @@ subdireectory of this repository.
All Java source code is included in the GitHub repository. You can build out
all the AW modules by running the 'build' shell script included with the AW
GitHub download. The script is for macOS Darwin Unix and should be edited for
your iown computing platform. You will have to install a Java JDK if you do
your own computing platform. You will have to install a Java JDK if you do
not have one already. Everything in the AW demonstration still has to run
from a command line.

Expand Down Expand Up @@ -133,7 +136,7 @@ Release History:
update Dprb for v0.7 changes in n-gram initialization
update documentation

v0,7.2 30aug2021 fix integration of AW-defined and user-defined indices
v0.7.2 30aug2021 fix integration of AW-defined and user-defined indices
update documentation

v0.7.3 04sep2021 add suffix rule to fix stemming glitch
Expand Down Expand Up @@ -306,3 +309,17 @@ Release History:
fix bug in word hash table for KEYWDR and PHRASR
fix bugs for phrase scoring
update documentation

v2.1 28oct2022 expand builtin 4-grams to 2,500
remove POLY- and MONO- from default literals
fix bug in building stopword table
fix bug in reading in syntactic type definitions
fix bugs in loading rewriting rules
fix ByteTool bug not keeping upper and lower case
fix bug in syntax symbol lookup
debug, clean up, and simplify syntax symbol table
clean up DPRO output for content profiles
fix problems in feature coding for phrase analysis
clean up and test joining and splitting in Reparser
make rules file for Reparser self-documenting
update documentation
4 changes: 2 additions & 2 deletions aw/Banner.java
Original file line number Diff line number Diff line change
Expand Up @@ -22,15 +22,15 @@
// OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
// OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
// -----------------------------------------------------------------------------
// AW File Banner.java : 13sep2022 CPM
// AW File Banner.java : 30sep2022 CPM
// main module banner class

package aw;

import java.io.*;

public final class Banner {
static final String version = "v2.0 (2022) Java";
static final String version = "v2.1 (2022) Java";
static final String year = "1997-2002, 2022";

String id;
Expand Down
1 change: 1 addition & 0 deletions build
Original file line number Diff line number Diff line change
Expand Up @@ -521,6 +521,7 @@ aw/phrase/EndingType.class \
aw/phrase/InflectionType.class \
aw/phrase/LexicalAtom.class \
aw/phrase/LexicalAtomStream.class \
aw/phrase/LexicalAtomStream\$AtomRotation.class \
aw/phrase/LexicalStream.class \
aw/phrase/LinedText.class \
aw/phrase/LiteralPattern.class \
Expand Down

0 comments on commit def96a6

Please sign in to comment.