Skip to content

HyperspaceAnalogueToLanguage

fozziethebeat edited this page Oct 26, 2011 · 3 revisions

Hyperspace Analogue to Language

Introduction

Hyperspace Analogue to Language (HAL) creates a semantic space from word co-occurrences. A word-by-word matrix is formed with each matrix element is the strength of association between the word represented by the row and the word represented by the column. The user of the algorithm then has the option to drop out low entropy columns from the matrix.

As the text is analyzed, a focus word is placed at the beginning of a ten word window that records which neighboring words are counted as co-occurring. Matrix values are accumulated by weighting the co-occurrence inversely proportional to the distance from the focus word; closer neighboring words are thought to reflect more of the focus word's semantics and so are weighted higher. HAL also records word-ordering information by treating the co-occurrence differently based on whether the neighboring word appeared before or after the focus word.

Typically, the all of the co-occurrence information is used to build semantic vectors are used (for an N x N matrix, these are 2*N in length). However, HAL also offers two possibilities for dimensionality reduction. Not all columns provide equal amount of information that can be used to distinguish the meanings of the words. Specifically, the information theoretic entropy of each column can be calculated as a way of ordering the columns by their importance. Using this ranking, either a fixed number of columns may be retained, or a threshold may be set to filter out low-entropy columns.

For more information on HAL, the following paper is the source of this algorithm:

  • Lund, K., & Burgess, C. (1996). Producing high-dimensional semantic spaces from lexical co-occurrence. Behavior Research Methods, Instrumentation, and Computers, 28, 203-208. Available [here] (http://locutus.ucr.edu/reprintPDFs/lb96brmic.pdf).

See [here] (http://locutus.ucr.edu/Reprints.html) for additional papers that use HAL.

Running HAL

We provide the following options for changing the behavior of HAL.

  • -s, --windowSize=INT how many words to consider in each direction
  • -r, --retain=INT how many column dimensions to retain in the final word co-occurrence matrix. The retained columns will be those that provide the most information for distinguishing the semantics of the words. Unlike the --threshold option, this specifies a hard limit for how many to retain. This option may not specified at the same time as --threshold
  • -h, --threshold=DOUBLE the minimum information theoretic entropy a word must have to be retained in the final word co-occurrence matrix. This option may not be used at the same time as --retain.
  • -W, --weighting=CLASS the fully qualified name of a WeightingFunction implementation. HAL traditionally uses a ramped, linear weighting where those words occurring closets receive more weight, with a linear decrease based on distance.